Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage_clients: stricter but simpler invariants for the pool #15681

Merged
merged 6 commits into from
Dec 19, 2023

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Dec 15, 2023

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Bug Fixes

  • Fix a rare bug where http client connections would vanish from the connection pool leading to various operations hanging while waiting for an http client.

@nvartolomei
Copy link
Contributor Author

/dt

@nvartolomei
Copy link
Contributor Author

/dt

@nvartolomei
Copy link
Contributor Author

/cdt

@nvartolomei nvartolomei force-pushed the nv/client-pool-metastable branch 2 times, most recently from 2025be0 to 3b2479e Compare December 16, 2023 20:22
@nvartolomei nvartolomei changed the title cloud_storage_clients: stricter invariants for the pool cloud_storage_clients: stricter but simpler invariants for the pool Dec 16, 2023
@nvartolomei nvartolomei marked this pull request as ready for review December 16, 2023 20:24
@nvartolomei
Copy link
Contributor Author

/dt

@vbotbuildovich
Copy link
Collaborator

new failures in https://buildkite.com/redpanda/redpanda/builds/42948#018c7740-e6c2-40d6-91d1-d17fedf2ace6:

"rptest.tests.e2e_iam_role_test.AWSRoleFetchTests.test_write"

@nvartolomei
Copy link
Contributor Author

/dt

can be used for synchronization in testing
Simplify the lease/borrow logic by never adding borrowed clients to the
local pool.
@piyushredpanda piyushredpanda added this to the v23.3.1-rc5 milestone Dec 19, 2023
@nvartolomei
Copy link
Contributor Author

nvartolomei commented Dec 19, 2023

  • Force-push with dropping the unnecessary condition variable wait change and improvements for one of the tests.

@@ -442,7 +442,7 @@ void client_pool::populate_client_pool() {
_cvar.signal();
}

client_pool::http_client_ptr client_pool::make_client() const {
client_pool::http_client_ptr client_pool::make_client() const noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: this work with us because an allocation failure gets handled with an abort and a stacktrace directly by the allocator, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s my assumption. The code using this method doesn’t seem to be exception safe so I made make_client fail hard rather than cause surprising behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's not exception safe? it doesn't change any state in client_pool
it's OK for this method to be noexcept, I'm just curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example is when we’re borrowing from another shard. If make_client throws an exception there, then we would never return the borrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, so it's the code in a different place is not exception safe if this method could throw

Comment on lines +485 to +488
vassert(
_pool.size() < _capacity,
"tried to release a client but the pool is at capacity");
_pool.emplace_back(std::move(leased));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a pretty big change of behavior, is this why the test was hanging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related to this, yes. The change in behavior doesn’t have a material effect. Used ManyPartitionsTest with tiered storage to confirm that.

src/v/cloud_storage_clients/client_pool.cc Show resolved Hide resolved
@andijcr
Copy link
Contributor

andijcr commented Dec 19, 2023

seems good, but should we wait for the LRC to have a chance to test this and in case backport on 23.3?

@nvartolomei
Copy link
Contributor Author

@andijcr can you elaborate on the LRC bit? My plan is to backport this to 23.3 and 23.2. You mean to delay backporting to 23.2?

@andijcr
Copy link
Contributor

andijcr commented Dec 19, 2023

@andijcr can you elaborate on the LRC bit? My plan is to backport this to 23.3 and 23.2. You mean to delay backporting to 23.2?

I meant delay the backport to 23.3 to have a chance to test that the new vassert in release_one is not triggered, if this pr is not a fix for an incident. not sure if we plan to run LRC with dev, was just a suggestion

// A gate for background operations. Most useful in testing where we want
// to wait all async housekeeping to complete before asserting state
// invariants.
ss::gate _bg_gate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it useful for anything other than tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for testing.

@@ -442,7 +442,7 @@ void client_pool::populate_client_pool() {
_cvar.signal();
}

client_pool::http_client_ptr client_pool::make_client() const {
client_pool::http_client_ptr client_pool::make_client() const noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it's not exception safe? it doesn't change any state in client_pool
it's OK for this method to be noexcept, I'm just curious

@nvartolomei nvartolomei merged commit 037380b into redpanda-data:dev Dec 19, 2023
19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-15681-v23.2.x-515 remotes/upstream/v23.2.x
git cherry-pick -x 286004247fd07bd2060c441e52a6ef257d96267c 8f68873e774dbb88a359d3ca561748892827fc00 a2141244a682c329a2e7d1c8a92be09bad48bd83 e2a762dd2c4db34d2ea18fad88d7ef82b8db5cb3 e16fe06f533f3ff210b1f778d4b7656236254ad7 a6111c0d45751422a301100c906185292ad74d90

Workflow run logs.

Comment on lines +274 to +275
co_await ssx::with_timeout_abortable(
_cvar.wait(), model::no_timeout, as);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvartolomei i don't think we should do this. when the abort source is delivered the waiting future is backgrounded. that's hard to analyze and maintain.

it looks like it might be safe in this case, but i'm still not 100% sure. in stop() when sem::broken() is called all of the waiter promises have their values set. but the continuation returned from wait() captures a reference to the condition variable and broken() doesn't wait on anything. what if seastar changed and inserted a continuation in that chain that captured a reference to _cvar to do something like implement some debugging feature? that'd be a potential use after free.

at the very least it looks like the _cvar.wait() future should be wrapped in a gate.

you could also use the sleeping variant of wait() and wake up periodically to poll the abort source. not as efficient, but it is clear and simple.

you could also do something like

as.subscribe([this] { _cvar.signal(); });
_cvar.wait();
if (_abort_source.abort_requested()) { ... }

it's also not entirely clear why the abort source is needed. we already have a condition variable. whatever is requesting the abort, could just broadcast on the condition variable and then each fiber waiting on the cvar can figure out the state of the system and what to do.

there are probably other formulations too, but let's reserve future backgrounding for when it's really necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the future returned by cvar.wait() can safely outlive the cvar. So it shouldn't be a problem to background it.

Signaling _cvar from as.subscribe callback looks safe though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are onto something. We'd have to _cvar.broadcast() instead of signal as we need to wake up a particular waiter. Since we don't know which one, we have to wake them all. I don't like waking up all waiters.

Re backgrounding: I'm pretty sure it is safe to do. In the worst case we'd end up with a broken_promise exception from there. Agree that it is not desirable. Couldn't come up with anything better at this moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall condition_variable::wait have a method which accepts an abort_source?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the future returned by cvar.wait() can safely outlive the cvar

maybe, but we should not have background futures that are not protected by a gate. is there an exception to this rule?

Copy link
Member

@dotnwat dotnwat Dec 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't come up with anything better at this moment.

let's background it under a gate, and think about a fix. we really shouldn't have backgrounded futures for cases like this. we can maybe integrate something into upstream seastar--i think that the continued safety of this approach is dependent on seastar cv implementation co-design--and that's a risk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polling is a common thing we do (using the timeout variant of wait). it is very light weight and makes any concern of background and lifetime completely vanish.

@@ -18,6 +18,7 @@
#include <seastar/core/smp.hh>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify the lease/borrow logic by never adding borrowed clients to the
local pool.

it would be nice if this commit message explained why things are simpler. the changes in the commit are complicated enough that it's not obvious that it is now some how simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have mentioned elsewhere, noted to myself to write less abstract commit messages in the future. Thank you for feedback! :thanks:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants