-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve throughput throttling #16441
Improve throughput throttling #16441
Conversation
9be68fd
to
c5327d4
Compare
c5327d4
to
22e66c1
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45208#018dcdda-5cc1-494a-a393-3311d137294a |
well.... this is dope! @c4milo for uptaking this into cloud. cc; @timyim-rp |
What's the difference between "v2 off" and "nightly (no limit)"? I would expect latency to be fine with "v2 off". It seems like (v1) throttling was still applied there. Further the "v2 on" seems quite a bit worse than "nightly (no limit)". How stable were those numbers? Usually the later runs are quite a bit worse as the topis stay around so later runs get worse if the topics are not deleted manually. |
I think "Nightly (no limit)" is with the limits set to
Yes, "v2 off" was with limits applied, with v1 throttling, so it should be compared to "Nightly (limits)".
I didn't run it lots of times, just once with the default 10mins or so. Runs were sequential with restarts in between, and no topic cleanup. Actually the order of the runs was: "Nightly (no limits)", "Nightly (limits)", "v2 on", "v2 off", I shuffled the table for readability. I was thinking it's kind of in the noise; I can tell you that no requests or responses were throttled with v2 on. |
if (_use_throttling_v2()) { | ||
if (_node_quota.in) { | ||
_node_quota.in->replenish(now); | ||
_node_quota.in->grab(request_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So given how the rovers in the shared_token_bucket work and given that we unconditionally grab from the bucket I think we can end up in a situation where:
- A large burst causes tail >> head
- Incoming rate then equals replenish rate / limit
- All requests continue to get dropped as tail >= head is a continuous state
Am I missing something? Is that the intended behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess assuming everyone fully respects the throttle time it will work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if a client doesn't respect the throttle time, it's enforced (on the second read), so as long as the set of outstanding requests doesn't overflow a uint64_t, it should all work.
/// reset() must be called on the home shard. | ||
/// The pointer is stable until the fiber yields. | ||
template<typename T> | ||
class sharded_ptr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this struct a bit more, I don't fully understand the usecase over a plain pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared_token_bucket
is const
, due to a const limit
(this probably makes the atomic operations manageable by limiting the number of atomic variables).
So I hold it by pointer. When the throughput limit is updated, I need to reseat that pointer with a new bucket that has a new limit. That must be done on the home core, but even with an atomic pointer, I can't guarantee that the pointed to bucket is stable between scheduling points.
We make multiple accesses to the token bucket, e.g.:
tb.duration_for(tb.deficiency(tb.grab(0))
So I need to pointer to be stable at least between scheduling points.
That's what sharded_ptr
is for; it keeps the original alive, updates the local copy on other shards, and then takes ownership of the new pointer, and destroys the old one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is only applicable when reset
is called? Since this is the only time you'd need to update the pointer? I'd assume this would be done if someone modifies a configuration variable to update the threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to see a bit more detail on how it is to be used safely. For example, unlike most sharded services, some methods are safe to call concurrently from other shards even before start()
(which happens implicitly) is called, right?
How about stop()
: what are the semantics of concurrent access on other shards while stop()
is called on the owner? Concurrent stop()
s on the owner? Etc.
If possible we should try to enforce those rules, as you've done with the owning shard assert, since these are things that are easy to mis-use and then they seem to work until one day ... boom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT
std::shared_ptr<T>
instances never escape the interface (e.g. dereference operators proxy to the underlying pointer), so I'm not entirely sure why_state
is a vector ofstd::shared_ptr<T>
instances. That is, we'd pay an atomic increment cost to copy the instances into_state
, but further accesses via the dereference operators aren't bumping the reference count. am i missing something here?
It made the implementation a lot cleaner.
But the intention was to allow a shared_ptr
or weak_ptr
to be held, although it's currently a bit YAGNI. Would making local
public suffice, or would another name be more appropriate? operator()
is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would making local public suffice, or would another name be more appropriate? operator() is available.
suffice for what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A name that can be used to bind a shared or weak pointer.,:
sharded_ptr<int> sp;
sp.reset(42);
std::shared_ptr<int> shared = p0.local();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a preference. I only was making the observation that if the shared pointer is not escaping the interface, then it didn't seem to serve a purpose to have one per core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnwat - you need the vector of std::shared_pointer
(or some other approach) for thread-safety, even if they don't escape the interface. With a single pointer shard 1 can be accessing the shared pointer, while shard 0 (the owning shard) is resetting it: this is not safe (e.g., the object could be destroyed after the pointer was read but before subsequent use, a UAF).
So the vector of shared pointers is to allow each shard to own a copy and update it in the usual message passing way.
22e66c1
to
600f598
Compare
Changes in force-push
|
/cdt |
src/v/ssx/sharded_ptr.h
Outdated
co_await _state | ||
.invoke_on_others([up](auto& state) noexcept { state = up; }) | ||
.finally([this, u{std::move(u)}]() mutable noexcept { | ||
std::default_delete<T>{}(_state.local()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what effect does this have? is it the same as doing _state.local().release()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_state.local()
is a raw pointer, it's basically delete _state.local();
I'm using default_delete
since that's what unique_ptr
uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe local()
is just the raw pointer type T*
. The u.release()
returns the raw pointer held by the std::unique_ptr
and tells std::unique_ptr
to not clean up the memory when destructed (ownership transfer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changed
src/v/ssx/sharded_ptr.h
Outdated
} | ||
co_return co_await _state.start(up).finally( | ||
[this, u{std::move(u)}]() mutable noexcept { | ||
if (u.get() == _state.local()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would this not be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case start
fails for some reason, possibly a failed allocation or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you want a similar condition in the finally for other case below (we were don't do the init)?
Overall this might be simpler to reason about if this init path just null-init'd the pointers on all the other shards, and then full through to the update logic below: fewer cases to think about. It's only a one-time inefficiency (per sharded_ptr).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed: no raw pointers or funky memory management.
src/v/ssx/sharded_ptr.h
Outdated
return _state.local_is_initialized() && _state.local() != nullptr; | ||
} | ||
|
||
ss::future<> reset(std::unique_ptr<T> u = nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we want a mutex here? I could see strange things possibly occurring if this method is called again when a previous invocation hasn't completely finished executing yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling of the pointers look safe to me, but there is a potential race between _state.local_is_initialized()
and _state.start()
. Maybe I'll split out a start method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a mutex
and rewrote it.
/// reset() must be called on the home shard. | ||
/// The pointer is stable until the fiber yields. | ||
template<typename T> | ||
class sharded_ptr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is only applicable when reset
is called? Since this is the only time you'd need to update the pointer? I'd assume this would be done if someone modifies a configuration variable to update the threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome :)
"reset must be called on home shard: ", | ||
_shard); | ||
} | ||
ss::shard_id _shard; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: would it be helpful to have a getter for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/v/ssx/sharded_ptr.h
Outdated
co_await _state | ||
.invoke_on_others([up](auto& state) noexcept { state = up; }) | ||
.finally([this, u{std::move(u)}]() mutable noexcept { | ||
std::default_delete<T>{}(_state.local()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe local()
is just the raw pointer type T*
. The u.release()
returns the raw pointer held by the std::unique_ptr
and tells std::unique_ptr
to not clean up the memory when destructed (ownership transfer).
void(u.release()); | ||
} | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this method has several races, but I guess it depends on the exact semantics of the sharded_ptr.
E.g., I would expect that when we are not doing any further reset()
calls on the pointer, all shards eventually converge on the same value for the pointer, right? I don't think that's going to be the case now since when the owning shard does some concurrent resets, different shards may get a different final value depending on what the last writer was on that shard. This re-ordering could happen on the owning shard, if it suspends part way through start()
or invoke_on_others
, in the shard<->shard queues (if reordering is allowed there) or on the other shards since task execution order is not guaranteed.
I think there are other races along these lines (all related to concurrent resets).
You could do something fancy with generation counter and ignore old updates, but a big fat mutex around reset should solve it? I think the mutex could also replace the gate: just take the mutex also in stop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, mutex works for me:
ss::future<> reset(std::unique_ptr<T> u = nullptr) {
assert_shard();
auto mu{co_await _mutex.get_units()};
auto up{u.get()};
std::unique_ptr<T> l{
!_state.local_is_initialized() ? nullptr : _state.local()};
co_await (
!_state.local_is_initialized()
? _state.start(up)
: _state.invoke_on_all([up](T*& state) noexcept { state = up; }))
.finally([this, l{std::move(l)}, u{std::move(u)}]() mutable noexcept {
if (_state.local() == u.get()) {
void(u.release());
} else if (_state.local() == l.get()) {
void(l.release());
}
});
}
ss::future<> stop() {
co_await reset();
co_await _mutex.with([this] { _mutex.broken(); });
co_await _state.stop();
}
600f598
to
06334ae
Compare
Changes in forcepush
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't made it through the entire PR, but left two questions on the sharded_ptr
commit.
src/v/ssx/sharded_ptr.h
Outdated
sharded_ptr(sharded_ptr&& other) noexcept = delete; | ||
sharded_ptr(sharded_ptr const&) noexcept = delete; | ||
sharded_ptr& operator=(sharded_ptr const&) = delete; | ||
sharded_ptr& operator=(sharded_ptr&&) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you intended to make the move assignment noexcept (and not the copy constructor)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you intended to make the move assignment noexcept (and not the copy constructor)?
That was not intentional. Do you have an opinion on not disallowing copy and move? I'm tempted to leave them default, or at least leaving move default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it movable and fixed up the destructor and noexcept.
/// reset() must be called on the home shard. | ||
/// The pointer is stable until the fiber yields. | ||
template<typename T> | ||
class sharded_ptr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT std::shared_ptr<T>
instances never escape the interface (e.g. dereference operators proxy to the underlying pointer), so I'm not entirely sure why _state
is a vector of std::shared_ptr<T>
instances. That is, we'd pay an atomic increment cost to copy the instances into _state
, but further accesses via the dereference operators aren't bumping the reference count. am i missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now minus the reference comment.
I guess you have decided against adding an interface to sharded_ptr
which returns a copy of the local std::shared_ptr
given we don't really need it yet?
06334ae
to
af9317b
Compare
Historical data for this test has shown the p75 end-to-end latency to always be around 5 to 6ms. This commit changes the validator to allow for p75 latency of 6ms or less to reduce the number of false positives in the nightly regression tests.
Signed-off-by: Ben Pope <ben@redpanda.com>
Refactor only, no functional change. Signed-off-by: Ben Pope <ben@redpanda.com>
New metrics: * `traffic_egress` - mirrors `traffic_intake` * `throttle_time` - Histogram of throttle time requested Signed-off-by: Ben Pope <ben@redpanda.com>
Signed-off-by: Ben Pope <ben@redpanda.com>
When `kafka_throughput_throttling_v2` is enabled (true by default), use an `ss::internal::shared_token_bucket` for requests across all shards, instead of balancing quota between shards. Signed-off-by: Ben Pope <ben@redpanda.com>
af9317b
to
6ddfb32
Compare
@parametrize(driver_idx="ACK_ALL_GROUP_LINGER_1MS_IDEM_MAX_IN_FLIGHT", | ||
workload_idx="RELEASE_CERT_SMOKE_LOAD_625k") | ||
def test_perf(self, driver_idx, workload_idx): | ||
@parametrize(driver_idx="ACK_ALL_GROUP_LINGER_1MS_IDEM_MAX_IN_FLIGHT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something went wrong rebasing here I assume? These seem like unrelated changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, should have used --keep-base
because I can't count to 6.
Changes in force-pushes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 lgtm
/backport v23.3.x |
Failed to create a backport PR to v23.3.x branch. I tried:
|
v2 was enabled by default in: redpanda-data#16441 So running the balancer doesn't do anything other than waste cpu cycles. Signed-off-by: Ben Pope <ben@redpanda.com>
v2 was enabled by default in: redpanda-data#16441 Running the balancer doesn't do anything other than waste cpu cycles. Signed-off-by: Ben Pope <ben@redpanda.com>
v2 was enabled by default in: redpanda-data#16441 Running the balancer doesn't do anything other than waste cpu cycles. Signed-off-by: Ben Pope <ben@redpanda.com> (cherry picked from commit e0ba45e)
When
kafka_throughput_throttling_v2
is enabled (true by default), use anss::internal::shared_token_bucket
for requests across all shards, instead of balancing quota between shards.Fixes https://github.com/redpanda-data/core-internal/issues/929
Fixes https://github.com/redpanda-data/core-internal/issues/1021
Fixes https://github.com/redpanda-data/core-internal/issues/1022
Fixes https://github.com/redpanda-data/core-internal/issues/1023
MaybeFixes: #8809
Cluster configuration - probably temporary
kafka_throughput_throttling_v2
instead of balancing quota between shards
true
kafka_throughput_replenish_threshold
Will be clamped between 1 and rate.
1
Metrics (prefix
vectorized_kafka_quotas_
)balancer_runs
quota_effective
traffic_intake
taken into processing, in bytes
traffic_egress
taken into processing, in bytes
throttle_time
Testing Tier 5
Aggregated Pub Latency (ms) avg
Artefacts
There's a (local) build on DockerHub
See also: https://github.com/redpanda-data/cloudv2/pull/12488
Backports Required
Release Notes
Improvements