Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle fetch requests on response size quota #7658

Merged
merged 4 commits into from
Jan 12, 2023

Conversation

ZeDRoman
Copy link
Contributor

@ZeDRoman ZeDRoman commented Dec 8, 2022

Don't use request throttling to fetch requests.
Fetch requests are small, so we don't need to throttle it on requests size.

We need to throttle fetch requests on response size.
This pr creates separate quota for response sizes. This quota is used for fetch requests.

We cannot calculate response size before we calculate response. So we can get current response size rate only when we have already evaluated response.
Delaying response when we have already done all calculation has no sense. So we write response size to quota manager and use this history to delay next fetch request that will come to node.

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

Release Notes

Features

  • Added fetch throttling mechanism. target_fetch_quota_byte_rate - target fetch size quota byte rate for client (bytes per second). Disabled by default

@ZeDRoman ZeDRoman force-pushed the throttle-response-size branch 3 times, most recently from 441e2e2 to a4bb7e3 Compare December 8, 2022 16:17
@ZeDRoman ZeDRoman requested a review from dlex December 8, 2022 20:59
@ZeDRoman ZeDRoman force-pushed the throttle-response-size branch 5 times, most recently from 476d57f to 2ac5095 Compare December 15, 2022 08:40
@ZeDRoman ZeDRoman changed the title Throttle on response size quota Throttle fetch requests on response size quota Dec 15, 2022
@ZeDRoman ZeDRoman marked this pull request as ready for review December 15, 2022 15:13
@@ -29,6 +29,7 @@
#include <absl/container/flat_hash_map.h>

#include <memory>
#include <string_view>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string_view is not used in this header

src/v/kafka/server/connection_context.cc Show resolved Hide resolved
@@ -302,6 +309,7 @@ class connection_context final
const bool _enable_authorizer;
ctx_log _authlog;
std::optional<security::tls::mtls_state> _mtls_state;
request_data _request_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discuss) Can it be a better idea to save this data in session_resources instead? As far as I understand, session_resources is an object that is local to handling of a request, and is strogly associated with it. Whereas connection_context is a long living one with lifetime same as connection's, and shared between all the requests from that connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

"target_fetch_quota_byte_rate",
"Target fetch size quota byte rate (bytes per second) - disabled default",
{.needs_restart = needs_restart::no, .visibility = visibility::user},
std::nullopt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(discuss) I think there is some value in making the response TP rate limiting symmetric to request TP limiting. Request TP limit default is 2GB/s, and when it was introduced it was not 100% compatible to what had been before. However the 2GB/s is quite a huge limit and I don't think that many of our customers can ever face it. Please let me know if you think there are any pros for making it different in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can make this disabled by default to save previous behavior.
It is hard to understand how much customers really consume. Maybe somebody have multiple consumers with same client_id
If customer wants this limit, he can configure it

rate,
*_target_fetch_tp_rate(),
it->second.tp_fetch_rate.window_size());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to supply rate_tracker& and uint32_t target_fetch_rate as arguments to quota_manager::throttle(). Then enum request_type won't be needed and the code will be saved from extra branching and from the code duplication above.

src/v/kafka/server/quota_manager.cc Outdated Show resolved Hide resolved
src/v/resource_mgmt/rate.h Show resolved Hide resolved
src/v/kafka/server/quota_manager.cc Outdated Show resolved Hide resolved
consumer.poll(timeout_ms=1000, max_records=1)
assert consumer.metrics(
)["consumer-fetch-manager-metrics"]["fetch-throttle-time-max"] > 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides just the fact that throttling occurs, it would be great to verify that the actual configured limit is honoured. For example, consume 100MB at 10MB/s rate and make sure that it does not happen faster than in 10s.

tests/rptest/tests/cluster_quota_test.py Outdated Show resolved Hide resolved
@ZeDRoman ZeDRoman requested a review from dlex December 16, 2022 20:29
@ZeDRoman ZeDRoman force-pushed the throttle-response-size branch 4 times, most recently from 700eec8 to eb37c79 Compare December 30, 2022 11:52
@emaxerrno
Copy link
Contributor

Question - per session (tcp) couldn’t the source core keep a float32 of a rate of bytes of fetches and predictively slow down the fetch request size by mutating the request object itself for some time period like say every 3 seconds it resets if no fetch is received

@ZeDRoman
Copy link
Contributor Author

ZeDRoman commented Jan 6, 2023

Question - per session (tcp) couldn’t the source core keep a float32 of a rate of bytes of fetches and predictively slow down the fetch request size by mutating the request object itself for some time period like say every 3 seconds it resets if no fetch is received

Do you mean something like predict next fetch size according to previous results?

dlex
dlex previously approved these changes Jan 7, 2023
Copy link
Contributor

@dlex dlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

== fetch_api::key) {
_server.quota_mgr().record_fetch_tp(
resp_and_res.resources->request_data.client_id, msg.size());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the try...catch here is only intended to mask exceptions thrown in conn->write(), so I would rather place this call before try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ZeDRoman
Copy link
Contributor Author

ZeDRoman commented Jan 9, 2023

/ci-repeat 10
skip-units
dt-repeat=100
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest

@ZeDRoman
Copy link
Contributor Author

/ci-repeat 10
skip-units
dt-repeat=100
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest

@ZeDRoman
Copy link
Contributor Author

/ci-repeat 10
skip-units
dt-repeat=100
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest

@ZeDRoman
Copy link
Contributor Author

/ci-repeat 10
skip-units
dt-repeat=100
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest

@ZeDRoman ZeDRoman force-pushed the throttle-response-size branch 3 times, most recently from bbb0f75 to 0d2c37c Compare January 11, 2023 20:44
@ZeDRoman
Copy link
Contributor Author

/ci-repeat 10
skip-units
dt-repeat=100
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest

@ZeDRoman
Copy link
Contributor Author

/ci-repeat 10
skip-units
dt-repeat=100
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest

Not apply request throttling to fetch requests.
Fetch request now throttled by response size.
If fetch request break quota limit next request will be throttled
according to previous rate.
@ZeDRoman
Copy link
Contributor Author

/ci-repeat 10
skip-units
dt-repeat=100
tests/rptest/tests/cluster_quota_test.py::ClusterRateQuotaTest

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to have conflicts with @dlex 's PR that adds new control at the kakfa layer? we should make sure the sequence of merging makes sense.

@ZeDRoman ZeDRoman requested a review from dlex January 12, 2023 16:12
@dlex
Copy link
Contributor

dlex commented Jan 12, 2023

is this going to have conflicts

np, I will resolve them

@ZeDRoman ZeDRoman merged commit a903a64 into redpanda-data:dev Jan 12, 2023
@dlex dlex mentioned this pull request Jan 12, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants