Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-2752 - Fix Kafka quota throttling delay enforcement #18218

Merged

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented May 2, 2024

Quota enforcement is currently done in a complicated and
not-Kafka-compatible way in Redpanda, and this intends to fix
that.

In Kafka >=2.0, client throttling is implemented in a simple way.
Brokers are meant to return how long the client is supposed to be
throttled when there is a quota violation. Then, the Kafka client is
supposed to wait until this throttling time passed, or else the broker
applies the throttle on its side.

However, currently in Redpanda the delay is enforced differently. For
produce, we enforce the throttle we would send in the response if there
was a throttle in the last response (regardless of whether the client
obeyed that throttle or not). For fetch, we always enforce the current
throttle. For ingress/egress quotas we correctly track how long the
client was supposed to be throttled, but we only do that for
ingress/egress throttling.

This fixes the throttling behaviour by tracking how long the
client is supposed to have waited and applying that throttle on the next
request if the client did not.

Fixes https://redpandadata.atlassian.net/browse/CORE-2752

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Enforce client quota throttling in a Kafka-compatible way, meaning we enforce the throttle delay on the next request if the client did not enforce it on its side.

@pgellert pgellert requested a review from a team May 2, 2024 13:07
@pgellert pgellert self-assigned this May 2, 2024
@pgellert pgellert requested review from michael-redpanda and removed request for a team May 2, 2024 13:07
src/v/kafka/server/snc_quota_manager.cc Outdated Show resolved Hide resolved
src/v/kafka/server/connection_context.h Outdated Show resolved Hide resolved
@pgellert pgellert force-pushed the quotas/fix-throttling-delay-enforcement branch from 64df46a to ebb9af3 Compare May 7, 2024 12:35
@pgellert
Copy link
Contributor Author

pgellert commented May 7, 2024

Force-pushed to remove the wrapper struct from connection_context and to fix exempt client handling in snc_quota_manager as per code review above.

@pgellert pgellert requested a review from BenPope May 7, 2024 12:46
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 7, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/48783#018f5349-7685-4a4a-aacf-31045b60fda3:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/48783#018f5349-7688-4e0e-94b5-2edf9064b89c:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48783#018f5352-1312-489e-8aa1-fc7ad40c43b4:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/48783#018f5352-130f-4450-8ab3-0eebaa089785:

"rptest.tests.e2e_shadow_indexing_test.EndToEndThrottlingTest.test_throttling.cloud_storage_type=CloudStorageType.ABS"

new failures in https://buildkite.com/redpanda/redpanda/builds/49219#018f8150-c106-4580-a04d-a434139cc1ed:

"rptest.tests.partition_balancer_test.PartitionBalancerTest.test_unavailable_nodes"

@pgellert pgellert force-pushed the quotas/fix-throttling-delay-enforcement branch from ebb9af3 to d8dbe9e Compare May 14, 2024 13:36
@pgellert
Copy link
Contributor Author

Force-pushed to rebase to dev to try to fix the flaky tests.

src/v/kafka/server/connection_context.h Outdated Show resolved Hide resolved
src/v/kafka/server/connection_context.cc Outdated Show resolved Hide resolved
While exempt clients are already excluded from their traffic being
recorded, they are still being throttled if the token bucket is in a
negative state because of other clients. To avoid this, exempt clients
should have a 0 throttling delay to be excluded from throttling
altogether.
@pgellert pgellert force-pushed the quotas/fix-throttling-delay-enforcement branch from d8dbe9e to 35ebcfe Compare May 16, 2024 10:49
@pgellert
Copy link
Contributor Author

Force-pushed with changes to separate the tracking the throttling state separately for fetch, produce and snc quotas. Now that I added a test with produce clients, I discovered that if we track throttled_until as a single field, we would throttle even metadata requests or even fetch requests when only the produce request is reached. That's not incorrect, because clients should enforce the delay in the response immediately regardless of the type of request, but it might be unexpected for customers to see requests that are not configured to be throttled being throttled.

Also, I moved the commit with the fix for snc quota exemption ahead.

@pgellert pgellert requested a review from BenPope May 16, 2024 10:56
src/v/kafka/server/connection_context.cc Outdated Show resolved Hide resolved
src/v/kafka/server/connection_context.cc Outdated Show resolved Hide resolved
src/v/kafka/server/connection_context.h Show resolved Hide resolved
Quota enforcement is currently done in a complicated and
not-Kafka-compatible way in Redpanda, and this commit intends to fix
that.

In Kafka >=2.0, client throttling is implemented in a simple way.
Brokers are meant to return how long the client is supposed to be
throttled when there is a quota violation. Then, the Kafka client is
supposed to wait until this throttling time passed, or else the broker
applies the throttle on its side.

However, currently in Redpanda the delay is enforced differently. For
produce, we enforce the throttle we would send in the response if there
was a throttle in the last response (regardless of whether the client
obeyed that throttle or not). For fetch, we always enforce the current
throttle. For ingress/egress quotas we correctly track how long the
client was supposed to be throttled, but we only do that for
ingress/egress throttling.

This commit fixes the throttling behaviour by tracking how long the
client is supposed to have waited and applying that throttle on the next
request if the client did not.
@pgellert pgellert force-pushed the quotas/fix-throttling-delay-enforcement branch from 35ebcfe to ff96288 Compare May 16, 2024 17:35
@pgellert
Copy link
Contributor Author

Force-pushed to address some more code review feedback.

@pgellert pgellert requested a review from BenPope May 16, 2024 17:37
BenPope
BenPope previously approved these changes May 17, 2024
@BenPope
Copy link
Member

BenPope commented May 17, 2024

Are there tests for these?

  1. If a connection is throttled for say, produce, can it still make other requests without throttle? (request and enforce)
  2. The fix for the delay being recorded on the connection rather than the context (which would have resulted in incorrect throttling)

@BenPope
Copy link
Member

BenPope commented May 17, 2024

Can you run a few rounds of ducktape for tests/rptest/tests/cluster_quota_test.py, to make sure they're all happy.

@pgellert pgellert dismissed stale reviews from michael-redpanda and BenPope via b540b74 May 17, 2024 13:14
@pgellert pgellert force-pushed the quotas/fix-throttling-delay-enforcement branch from ff96288 to b540b74 Compare May 17, 2024 13:14
@pgellert
Copy link
Contributor Author

@BenPope

If a connection is throttled for say, produce, can it still make other requests without throttle? (request and enforce)

I've force-pushed an explicit test for this now.

The fix for the delay being recorded on the connection rather than the context (which would have resulted in incorrect throttling)

This has already been there in the last commit.

Can you run a few rounds of ducktape for tests/rptest/tests/cluster_quota_test.py, to make sure they're all happy.

Locally I've run these with --repeat 10 and they were all happy.

@pgellert pgellert requested a review from BenPope May 17, 2024 13:16
@pgellert pgellert merged commit 77d54fc into redpanda-data:dev May 17, 2024
20 checks passed
@pgellert pgellert deleted the quotas/fix-throttling-delay-enforcement branch May 17, 2024 17:17
@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-18218-v23.3.x-550 remotes/upstream/v23.3.x
git cherry-pick -x cea3a2947363f3fdad1f511ca0e92f7101d6b0b6 aec585d62263e8528acf38cfd9b789ae63bb3ad6 286aaa7bae996bdc4aeab7df35a60f519ed2f27f fd18208466fc4ff3b68fe493cb20cf3b48bf1ef5 b540b748305ee050ac065043e36e580f73ca3a0c

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants