-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage_clients: stricter but simpler invariants for the pool #15681
cloud_storage_clients: stricter but simpler invariants for the pool #15681
Conversation
/dt |
be0c504
to
dec87fd
Compare
/dt |
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/42883#018c6e74-8533-4bd2-a43d-d89958b8fb16 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/42883#018c6e64-011b-4532-88b3-39986c08cb68 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/42948#018c7740-e6bc-4146-b15d-1d9732355bb9 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/42952#018c785b-7b37-4bdd-adf0-44baa5c07c93 |
/cdt |
2025be0
to
3b2479e
Compare
/dt |
new failures in https://buildkite.com/redpanda/redpanda/builds/42948#018c7740-e6c2-40d6-91d1-d17fedf2ace6:
|
3b2479e
to
8f474ba
Compare
/dt |
can be used for synchronization in testing
Simplify the lease/borrow logic by never adding borrowed clients to the local pool.
8f474ba
to
a6111c0
Compare
|
@@ -442,7 +442,7 @@ void client_pool::populate_client_pool() { | |||
_cvar.signal(); | |||
} | |||
|
|||
client_pool::http_client_ptr client_pool::make_client() const { | |||
client_pool::http_client_ptr client_pool::make_client() const noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: this work with us because an allocation failure gets handled with an abort and a stacktrace directly by the allocator, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s my assumption. The code using this method doesn’t seem to be exception safe so I made make_client fail hard rather than cause surprising behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's not exception safe? it doesn't change any state in client_pool
it's OK for this method to be noexcept
, I'm just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example is when we’re borrowing from another shard. If make_client throws an exception there, then we would never return the borrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, so it's the code in a different place is not exception safe if this method could throw
vassert( | ||
_pool.size() < _capacity, | ||
"tried to release a client but the pool is at capacity"); | ||
_pool.emplace_back(std::move(leased)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a pretty big change of behavior, is this why the test was hanging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is related to this, yes. The change in behavior doesn’t have a material effect. Used ManyPartitionsTest with tiered storage to confirm that.
seems good, but should we wait for the LRC to have a chance to test this and in case backport on 23.3? |
@andijcr can you elaborate on the LRC bit? My plan is to backport this to 23.3 and 23.2. You mean to delay backporting to 23.2? |
I meant delay the backport to 23.3 to have a chance to test that the new vassert in release_one is not triggered, if this pr is not a fix for an incident. not sure if we plan to run LRC with dev, was just a suggestion |
// A gate for background operations. Most useful in testing where we want | ||
// to wait all async housekeeping to complete before asserting state | ||
// invariants. | ||
ss::gate _bg_gate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it useful for anything other than tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for testing.
@@ -442,7 +442,7 @@ void client_pool::populate_client_pool() { | |||
_cvar.signal(); | |||
} | |||
|
|||
client_pool::http_client_ptr client_pool::make_client() const { | |||
client_pool::http_client_ptr client_pool::make_client() const noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's not exception safe? it doesn't change any state in client_pool
it's OK for this method to be noexcept
, I'm just curious
/backport v23.3.x |
/backport v23.2.x |
Failed to create a backport PR to v23.2.x branch. I tried:
|
co_await ssx::with_timeout_abortable( | ||
_cvar.wait(), model::no_timeout, as); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvartolomei i don't think we should do this. when the abort source is delivered the waiting future is backgrounded. that's hard to analyze and maintain.
it looks like it might be safe in this case, but i'm still not 100% sure. in stop() when sem::broken() is called all of the waiter promises have their values set. but the continuation returned from wait() captures a reference to the condition variable and broken() doesn't wait on anything. what if seastar changed and inserted a continuation in that chain that captured a reference to _cvar
to do something like implement some debugging feature? that'd be a potential use after free.
at the very least it looks like the _cvar.wait()
future should be wrapped in a gate.
you could also use the sleeping variant of wait() and wake up periodically to poll the abort source. not as efficient, but it is clear and simple.
you could also do something like
as.subscribe([this] { _cvar.signal(); });
_cvar.wait();
if (_abort_source.abort_requested()) { ... }
it's also not entirely clear why the abort source is needed. we already have a condition variable. whatever is requesting the abort, could just broadcast on the condition variable and then each fiber waiting on the cvar can figure out the state of the system and what to do.
there are probably other formulations too, but let's reserve future backgrounding for when it's really necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the future returned by cvar.wait()
can safely outlive the cvar. So it shouldn't be a problem to background it.
Signaling _cvar
from as.subscribe
callback looks safe though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are onto something. We'd have to _cvar.broadcast()
instead of signal
as we need to wake up a particular waiter. Since we don't know which one, we have to wake them all. I don't like waking up all waiters.
Re backgrounding: I'm pretty sure it is safe to do. In the worst case we'd end up with a broken_promise
exception from there. Agree that it is not desirable. Couldn't come up with anything better at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall condition_variable::wait have a method which accepts an abort_source
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the future returned by cvar.wait() can safely outlive the cvar
maybe, but we should not have background futures that are not protected by a gate. is there an exception to this rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't come up with anything better at this moment.
let's background it under a gate, and think about a fix. we really shouldn't have backgrounded futures for cases like this. we can maybe integrate something into upstream seastar--i think that the continued safety of this approach is dependent on seastar cv implementation co-design--and that's a risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
polling is a common thing we do (using the timeout variant of wait). it is very light weight and makes any concern of background and lifetime completely vanish.
@@ -18,6 +18,7 @@ | |||
#include <seastar/core/smp.hh> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify the lease/borrow logic by never adding borrowed clients to the
local pool.
it would be nice if this commit message explained why things are simpler. the changes in the commit are complicated enough that it's not obvious that it is now some how simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I have mentioned elsewhere, noted to myself to write less abstract commit messages in the future. Thank you for feedback! :thanks:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Backports Required
Release Notes
Bug Fixes