-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud storage shutdown hang #8587
Conversation
30add60
to
c7e8cd9
Compare
ss::maybe_yield().get(); | ||
ss::sleep(std::chrono::milliseconds(10)).get(); | ||
api.shutdown_connections(); | ||
g.close().get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this fails, does it manifest as a hang of the test? If so, that was fine as a short term reproducer, but for committing it we should wrap this in a spin_wait_with_timeout or similar, so that the test fails cleanly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
The other fanout fix PR merged, so this can be rebased + drop the commit on that file. |
src/v/storage/parser.cc
Outdated
@@ -217,22 +217,22 @@ static ss::future<result<model::record_batch_header>> read_header_impl( | |||
|
|||
ss::future<result<model::record_batch_header>> | |||
continuous_batch_parser::read_header() { | |||
return read_header_impl(get_stream(), *_consumer, _recovery); | |||
auto& st = get_stream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this one was arguably easier to read before: is this perhaps a place that you coroutinized to add some debug, but it's gone now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this is leftover from some debugging. I'll actually move the coroutinizing into a separate PR, to keep this one focused on actual fixes.
src/v/cloud_storage/remote.cc
Outdated
@@ -100,6 +102,7 @@ ss::future<> remote::stop() { | |||
void remote::shutdown_connections() { | |||
cst_log.debug("Shutting down remote connections..."); | |||
_pool.shutdown_connections(); | |||
_as.request_abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but it shouldn't be functionally necessary, right? If it was, then I would think something was wrong elsewhere.
If this is just to help with logging, then it's fine, but maybe add a comment to explain that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was one of the first things I was suspicious of, but it ended up not being necessary. I'll remove it from the PR.
@@ -304,13 +303,13 @@ class partition_record_batch_reader_impl final | |||
unknown_exception_ptr = std::current_exception(); | |||
} | |||
|
|||
// The reader may have been left in an indeterminate state. | |||
// Re-set the pointer to it to ensure that it will not be reused. | |||
// Regardless of which error, the reader may have been left in an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this confused me for a moment, until I realized that we only fall through here in case of errors: maybe tweak the comment to mention that this path is only taken in case of exceptions from the above try{}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
When a node shutdown is taking a long time, It's helpful in understanding in which subsystems the shutdown is spending its time. This commit adds some log statements to aide in debugging that helped identify the source of a hang.
We previously only called set_end_of_stream() when met with an unexpected exception, citing that the stream is left in an undefined state. It seems reasonable to do this with gate_closed_exceptions, since we do this explicitly from the loop if the gate is closed.
We'd previously print "{}" rather than the target string.
We could throw `boost::system::system_error (partial message)` when shutting down the HTTP client in the cloud storage hydration loop. ``` ERROR 2023-02-01 22:32:39,545 [shard 1] cloud_storage - [fiber12~0 kafka/scale_000000/1 [0:642]] - remote_segment.cc:708 - Error in hydraton loop: boost::system::system_error (partial message) ``` This is caused by the HTTP client being shutdown, so this commit makes the client first check if the client is being shutdown before throwing any other error.
Per Seastar guidance[1], we should be returning exceptional futures for asynchronous functions, rather than throwing. This resulted in an uncaught exception. [1] https://docs.seastar.io/master/split/7.html
Reproduces redpanda-data#8331 in debug mode.
c7e8cd9
to
bb6ef04
Compare
I removed the rejiggering of the |
Failures were #8662 |
/backport v22.3.x |
Failed to run cherry-pick command. I executed the below command:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great set of changes!
This PR cleans up shutdown in a few places that, in summation contributed to a hang in shutdown when tiered storage is enabled. This also adds some logs, and updates some methods to coroutines to facilitate logging.
Fixes #8331
Backports Required
UX Changes
Release Notes
Bug Fixes