Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kip 368 (SASL reauthentication) #13822

Merged
merged 9 commits into from
Nov 4, 2023
Merged

Conversation

oleiman
Copy link
Member

@oleiman oleiman commented Sep 29, 2023

This PR implements most of KIP-368.

This PR also bumps kafka client ducktape dependencies to a version that supports reauthentication.

Fixes https://github.com/redpanda-data/core-internal/issues/775

TODO:

  • The KIP describes some metrics related to reauthentication; these could also help with testing.
  • OAUTHBEARER support in general is WIP - some additional logic is required here to feed token expiration time into session lifetime calculation. Implementation and tests (based on POC OIDC support) are at the head of this branch
  • Test Kerberos/GSSAPI
    • CI not currently supported. Needs a manual smoke test. Results to follow
    • Update: Some challenges here:
      • Had some issues setting up ADDS last week - as yet unresolved
      • Ducktape/kerberos bits are working fine, but we don't have a client available that supports both GSSAPI and kip-368

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Features

  • Add broker support for SASL reauthentication. To enable, config connections_max_reauth_ms > 0.

@oleiman oleiman requested a review from a team as a code owner September 29, 2023 18:03
@oleiman oleiman requested review from andrewhsu and removed request for a team September 29, 2023 18:03
@oleiman oleiman marked this pull request as draft September 29, 2023 18:03
@oleiman oleiman self-assigned this Sep 29, 2023
@vbotbuildovich
Copy link
Collaborator

@oleiman
Copy link
Member Author

oleiman commented Sep 30, 2023

@oleiman oleiman marked this pull request as ready for review September 30, 2023 00:30
@vbotbuildovich
Copy link
Collaborator

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just some minor comments.

@@ -2513,7 +2513,14 @@ configuration::configuration()
"The sample period for the CPU profiler",
{.needs_restart = needs_restart::no, .visibility = visibility::user},
100ms,
{.min = 1ms}) {}
{.min = 1ms})
, connections_max_reauth_ms(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should this be kafka_conn... ?
  • Is this specific to SASL in a way that makes sense to express in the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named to match the kafka equivalent, but the more I poke around that doesn't seem necessary or expected at all. It is indeed specific to SASL, so maybe something like kafka_sasl_max_reauth_ms

, connections_max_reauth_ms(
*this,
"connections_max_reauth_ms",
"If gtz, the maximum time between client reauthentication",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's write this out like we might expect to see in documentation (e.g. avoiding short hand gtz).

"The maximum time between Kafka client reauthentication. <maybe a short sentence about what/why>. Reauthentication is disabled if the value is 0".

"If gtz, the maximum time between client reauthentication",
{.needs_restart = needs_restart::yes, .visibility = visibility::user},
0ms,
{.min = 0ms}) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also common to use std::nullopt to disable a feature, but I don't have a strong opinion about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, sort of aping the kafka version. I think null is more clear in this case though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, sort of aping the kafka version. I think null is more clear in this case though.

ahh yeh makes sense. sometimes i go ping product and get their opinion on stuff like this.

Comment on lines 125 to 128
vlog(
klog.trace,
"SASL session_lifetime_ms: {}",
sasl() ? sasl()->session_lifetime_ms() : -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is connection oriented option, is it necessary to log it on its own on a per-request granularity? perhaps there is another request-level trace message we could expand? no specific suggestion right now--maybe this is the best we can do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convenient during development as a countdown, but in retrospect I don't think we need a trace log for this at all.

@@ -249,6 +249,14 @@ process_result_stages process_request(
request_context&& ctx,
ss::smp_service_group g,
const session_resources& sres) {
auto& key = ctx.header().key;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for a mutable reference here, key is a small integer just grab a copy?

unlikely(ctx.sasl() && ctx.sasl()->complete() && ctx.sasl()->expired())
&& key != sasl_handshake_handler::api::key
&& key != sasl_authenticate_handler::api::key) {
throw sasl_session_expired_exception("Session expired");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the information is handy, it'd probably worth adding some context to this like how long the expiration was set to, and maybe the client id if isn't already logged at a higher level (or maybe higher level is where that extra context should go).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

sasl_authenticate_response_data data{
.error_code = error_code::none,
.error_message = std::nullopt,
.auth_bytes = std::move(result.value()),
.session_lifetime_ms = ctx.sasl()->session_lifetime_ms(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's generally preferred to maintain std::chrono types as long as possible, so having session_lifetime_ms return a std::chrono::duration and then applying the serialization logic for the response here (ie std::chrono::duration_cast<ms>(session_lifetime()).count())

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// Only initialise sasl state if sasl is enabled
auto sasl = authn_method == config::broker_authn_method::sasl
? std::make_optional<security::sasl_server>(
security::sasl_server::sasl_state::initial)
security::sasl_server::sasl_state::initial,
conn_max_reauth_ms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an argument for making this max parameter be a config::binding<ms> is that since the configuration is locked in when the connection is created, changes to the max reauth only take affect for new connections. it's probably not a big deal, but in principle someone might want to lower the max reauth but wouldn't have a way to forcefully shutdown a client connection without restarting redpanda.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config is currently set up to require a restart, but again that's just to match what kafka offers. I figure the reasoning there is that a live misconfiguration could be fairly disruptive to connected clients, but the scenario you're describing makes sense from a usability standpoint. Not sure about this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one.

I think it is probably ok as-is. What I'd do is just give a little thought to the possible scenarios that might occur and if any of those scenarios have the property that we wish it were live updatable in the context of an incident.

Comment on lines 261 to 263
// This is a client-driven reauthentication
if (unlikely(
ctx.sasl() && ctx.sasl()->complete()
&& key == sasl_handshake_handler::api::key)) {
vlog(
klog.debug,
"SASL reauthentication detected - resetting authn server");
ctx.sasl()->reset();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could combine this with the conditional above?

if (sasl.complete()) {
   if (handshake()) {
      reset()
   } else if (expired && !authenticate()) {
     ....
   }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Had a separate state in a previous iteration I think.

Comment on lines 72 to 140
assert (
any(reauths[n.name] > 0 for n in self.redpanda.nodes)
), f"Expected client reauth on some broker...Reauths: {json.dumps(reauths, indent=1)}"

exps = {}
for node in self.redpanda.nodes:
exps[node.name] = self.redpanda.count_log_node(
node, "Session expired")

assert (
all(exps[n.name] == 0 for n in self.redpanda.nodes)
), f"Client should reauth before session expiry...Expirations: {json.dumps(exps, indent=1)}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two separate tests, in that exps could be empty because full session expiry and connection shutdown isn't tested explicitly, so the final assert condition is trivially true?

Copy link
Member Author

@oleiman oleiman Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this works is that the SASLAuthenticateRequest returns a lifetime field for the time to session expiry, then librdkafka will always try to reauthenticate once 90% of that lifetime has elapsed.

So this is asserting (albeit obliquely) that the broker didn't lie about session expiry and didn't tear tings down prematurely.

I don't see an obvious way to explicitly test this type of connection shutdown, since configuring connections_max_reauth_ms --> lifetime_ms in response --> client proactively reauths @ T + 0.9 * lifetime_ms.

@oleiman
Copy link
Member Author

oleiman commented Oct 2, 2023

Thanks for the review @dotnwat. Several decisions in this PR were made for compatibility with the corresponding kafka config (naming, semantics, etc). I may have made that up as a goal though...can you confirm one way or the other? In a general sense, not necessarily for this feature.

@dotnwat
Copy link
Member

dotnwat commented Oct 5, 2023

were made for compatibility with the corresponding kafka config (naming, semantics, etc)

I think this is a valid reason for making decisions most of the time.

@oleiman oleiman force-pushed the kip-368 branch 3 times, most recently from be39aa6 to 9cac240 Compare October 10, 2023 22:35
@oleiman
Copy link
Member Author

oleiman commented Oct 10, 2023

force push to add a kerberos test

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@oleiman
Copy link
Member Author

oleiman commented Oct 11, 2023

force push to take up ducktape version revert (merge conflict)

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40256#018b8c90-a295-4926-ad3a-480972393398: "rptest.tests.node_pool_migration_test.NodePoolMigrationTest.test_migrating_redpanda_nodes_to_new_pool.balancing_mode=off"
"rptest.tests.upgrade_test.UpgradeFromPriorFeatureVersionTest.test_basic_upgrade"
"rptest.tests.upgrade_test.RedpandaInstallerTest.test_install_by_line"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_and_upgrade"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_finishes_after_manual_cancellation.delete_topic=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=False.tick_interval=5000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_flipping_decommission_recommission.node_is_alive=True"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade_with_rollback.upgrade_after_rollback=True"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_synthetic_too_new_node_join"
"rptest.tests.controller_snapshot_test.ControllerSnapshotTest.test_upgrade_compat"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_deletion_stops_move.num_to_upgrade=2"
"rptest.tests.raft_recovery_test.RaftRecoveryUpgradeTest.test_upgrade"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_after_upgrade.empty_seed_starts_cluster=True"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_rollback"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default.wipe_cache=True"
"rptest.tests.controller_snapshot_test.ControllerSnapshotPolicyTest.test_upgrade_auto_enable"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_moving_not_fully_initialized_partition.num_to_upgrade=2"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40256#018b8c90-a28c-486b-b5c0-fc9042843fd3: "rptest.tests.cluster_features_test.FeaturesSingleNodeTest.test_get_features"
"rptest.tests.read_replica_e2e_test.ReadReplicasUpgradeTest.test_upgrades.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.controller_upgrade_test.ControllerUpgradeTest.test_updating_cluster_when_executing_operations"
"rptest.tests.memory_stress_test.MemoryStressTest.test_fetch_with_many_partitions.memory_share_for_fetch=0.5"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_move_consumer_offsets_intranode.num_to_upgrade=2"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS"
"rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.num_to_upgrade=0.with_tiered_storage=False"
"rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.with_tiered_storage=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_cancel_ongoing_movements"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=False"
"rptest.tests.acls_test.AccessControlListTestUpgrade.test_upgrade_sasl"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=True.tick_interval=3600000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_multiple_decommissions"
"rptest.tests.nodes_decommissioning_test.NodeDecommissionFailureReportingTest.test_allocation_failure_reporting"
"rptest.tests.node_folder_deletion_test.NodeFolderDeletionTest.test_deleting_node_folder"
"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=False"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=False"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=False"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_explicit_activation"
"rptest.tests.cluster_features_test.FeaturesMultiNodeUpgradeTest.test_upgrade"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_empty.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeFromSpecificVersion.test_basic_upgrade"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40256#018b8c90-a28f-41e1-80c0-2e4af4861d18: "rptest.tests.cluster_features_test.FeaturesSingleNodeUpgradeTest.test_upgrade"
"rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=False.num_to_upgrade=0.with_tiered_storage=True"
"rptest.tests.random_node_operations_test.RandomNodeOperationsTest.test_node_operations.enable_failures=True.num_to_upgrade=0.with_tiered_storage=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_working_node.delete_topic=True.tick_interval=5000"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_crashed_node"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_decommissioning_rebalancing_node.shutdown_decommissioned=True"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_recommissioning_node_finishes"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_bootstrapping_after_move.num_to_upgrade=2"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_overlapping_changes.num_to_upgrade=2"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=2.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=2.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.nodes_decommissioning_test.NodeDecommissionSpaceManagementTest.test_decommission.single_partition=False"
"rptest.tests.nodes_decommissioning_test.NodesDecommissioningTest.test_node_is_not_allowed_to_join_after_restart.new_bootstrap=False"
"rptest.tests.upgrade_test.UpgradeWithWorkloadTest.test_rolling_upgrade"
"rptest.tests.cluster_features_test.FeaturesNodeJoinTest.test_old_node_join"
"rptest.tests.partition_movement_test.PartitionMovementTest.test_static.num_to_upgrade=2"
"rptest.tests.upgrade_test.UpgradeBackToBackTest.test_upgrade_with_all_workloads.single_upgrade=True"
"rptest.tests.workload_upgrade_runner_test.RedpandaUpgradeTest.test_workloads_through_releases.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.cluster_bootstrap_test.ClusterBootstrapUpgrade.test_change_bootstrap_configs_during_upgrade.empty_seed_starts_cluster=True"
"rptest.tests.cluster_features_test.FeaturesMultiNodeTest.test_get_features"
"rptest.tests.cluster_config_test.ClusterConfigLegacyDefaultTest.test_legacy_default_explicit_after_upgrade.wipe_cache=True"
"rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade"
"rptest.tests.license_upgrade_test.UpgradeToLicenseChecks.test_basic_upgrade.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.pandaproxy_test.BasicAuthUpgradeTest.test_upgrade_and_enable_basic_auth.base_release=.22.2.next_release=.22.3"
"rptest.tests.offset_retention_test.OffsetRetentionDisabledAfterUpgrade.test_upgrade_from_pre_v23.initial_version=.22.2.9"
"rptest.tests.topic_creation_test.CreateTopicUpgradeTest.test_retention_config_on_upgrade_from_v22_2_to_v22_3.cloud_storage_type=CloudStorageType.S3"
"rptest.tests.transactions_test.GATransaction_v22_1_UpgradeTest.upgrade_coordinator_test"

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

src/v/kafka/server/connection_context.cc Show resolved Hide resolved
tests/rptest/tests/redpanda_oauth_test.py Outdated Show resolved Hide resolved
tests/rptest/tests/scram_test.py Show resolved Hide resolved
BenPope
BenPope previously approved these changes Nov 2, 2023
@oleiman
Copy link
Member Author

oleiman commented Nov 2, 2023

force push for a last bit of cleanup in ducktape tests

BenPope
BenPope previously approved these changes Nov 2, 2023
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40323#018b9151-268c-4172-8844-d7635d91f304: "rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=delete"
"rptest.tests.redpanda_oauth_test.OIDCReauthTest.test_oidc_reauth"

SCRAM, GSSAPI, OIDC

Reauth tests are added alongside other integration tests for the
associated authn mechanism. sasl_reauth_test.py contains tests more
generic to reauth configuration along with some utility functions
for accessing sasl metrics.

This commit also adds a `KrbClient.produce` which sticks a little
bit of python onto the Kerberos node and executes it. This bit of
python configures a GSSAPI-authenticated producer publishes a
specified number of records.

This commit also makes some minor changes to KeycloakService and
PythonLibrdkafka to account for token lifetime (configuration and
challenge tracking).

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman
Copy link
Member Author

oleiman commented Nov 2, 2023

force push for OIDC test timing. Not necessarily fixed; needs a longer repeat run

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40345#018b9223-3b08-4c15-b08e-88bff54fade0: "rptest.tests.e2e_shadow_indexing_test.EndToEndShadowIndexingTest.test_recover_after_delete_records"

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/40345#018b9223-3b0a-4b91-a8e4-d0bd7e2a4b4e: "rptest.tests.memory_stress_test.MemoryStressTest.test_fetch_with_many_partitions.memory_share_for_fetch=0.7"

@oleiman
Copy link
Member Author

oleiman commented Nov 3, 2023

/ci-repeat 1
tests/rptest/tests/redpanda_oauth_test.py::OIDCReauthTest.test_oidc_reauth
dt-repeat=100
skip-units

@oleiman
Copy link
Member Author

oleiman commented Nov 3, 2023

/ci-repeat 1
dt-repeat=5
skip-units
skip-redpanda-build

@oleiman
Copy link
Member Author

oleiman commented Nov 3, 2023

@piyushredpanda
Copy link
Contributor

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@piyushredpanda piyushredpanda merged commit c04485d into redpanda-data:dev Nov 4, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants