Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud storage shutdown hang #8587

Merged
merged 7 commits into from
Feb 8, 2023

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Feb 3, 2023

This PR cleans up shutdown in a few places that, in summation contributed to a hang in shutdown when tiered storage is enabled. This also adds some logs, and updates some methods to coroutines to facilitate logging.

Fixes #8331

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Bug Fixes

  • Fixed a hang at shutdown when using tiered storage.

@andrwng andrwng marked this pull request as ready for review February 4, 2023 00:25
ss::maybe_yield().get();
ss::sleep(std::chrono::milliseconds(10)).get();
api.shutdown_connections();
g.close().get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this fails, does it manifest as a hang of the test? If so, that was fine as a short term reproducer, but for committing it we should wrap this in a spin_wait_with_timeout or similar, so that the test fails cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jcsp
Copy link
Contributor

jcsp commented Feb 6, 2023

The other fanout fix PR merged, so this can be rebased + drop the commit on that file.

@@ -217,22 +217,22 @@ static ss::future<result<model::record_batch_header>> read_header_impl(

ss::future<result<model::record_batch_header>>
continuous_batch_parser::read_header() {
return read_header_impl(get_stream(), *_consumer, _recovery);
auto& st = get_stream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this one was arguably easier to read before: is this perhaps a place that you coroutinized to add some debug, but it's gone now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is leftover from some debugging. I'll actually move the coroutinizing into a separate PR, to keep this one focused on actual fixes.

@@ -100,6 +102,7 @@ ss::future<> remote::stop() {
void remote::shutdown_connections() {
cst_log.debug("Shutting down remote connections...");
_pool.shutdown_connections();
_as.request_abort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but it shouldn't be functionally necessary, right? If it was, then I would think something was wrong elsewhere.

If this is just to help with logging, then it's fine, but maybe add a comment to explain that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was one of the first things I was suspicious of, but it ended up not being necessary. I'll remove it from the PR.

@@ -304,13 +303,13 @@ class partition_record_batch_reader_impl final
unknown_exception_ptr = std::current_exception();
}

// The reader may have been left in an indeterminate state.
// Re-set the pointer to it to ensure that it will not be reused.
// Regardless of which error, the reader may have been left in an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this confused me for a moment, until I realized that we only fall through here in case of errors: maybe tweak the comment to mention that this path is only taken in case of exceptions from the above try{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

When a node shutdown is taking a long time, It's helpful in
understanding in which subsystems the shutdown is spending its time.

This commit adds some log statements to aide in debugging that helped
identify the source of a hang.
We previously only called set_end_of_stream() when met with an unexpected
exception, citing that the stream is left in an undefined state. It
seems reasonable to do this with gate_closed_exceptions, since we do
this explicitly from the loop if the gate is closed.
We'd previously print "{}" rather than the target string.
We could throw `boost::system::system_error (partial message)` when
shutting down the HTTP client in the cloud storage hydration loop.

```
ERROR 2023-02-01 22:32:39,545 [shard 1] cloud_storage - [fiber12~0 kafka/scale_000000/1 [0:642]] - remote_segment.cc:708 - Error in hydraton loop: boost::system::system_error (partial message)
```

This is caused by the HTTP client being shutdown, so this commit makes
the client first check if the client is being shutdown before throwing
any other error.
Per Seastar guidance[1], we should be returning exceptional futures for
asynchronous functions, rather than throwing. This resulted in an
uncaught exception.

[1] https://docs.seastar.io/master/split/7.html
@andrwng
Copy link
Contributor Author

andrwng commented Feb 6, 2023

I removed the rejiggering of the remote abort_source, and moved the coroutinizing commit into a separate PR in hopes that'll make this one a bit more backportable #8653

@jcsp jcsp merged commit 5a5b310 into redpanda-data:dev Feb 8, 2023
@jcsp
Copy link
Contributor

jcsp commented Feb 8, 2023

Failures were #8662

@vshtokman
Copy link
Contributor

/backport v22.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x f72a7ed8f6e0ed825542b59e93f8cdca3a80a2fb b9cf31f7276a6f591b952b37d0315c9bf4f5ab3a d7e3f19bbbdc0afab3a02ebe9daff3494fad9bc6 9488d4ad2a7d17cfc32450af59db4b7ed5605f27 fe657df8ff084a83d3b1211df589191738388842 4c133a0801d11223463cffb44ed92a6aeecd8d99 bb6ef0409ad0f908cd9a0ca987343975e01dab96

Workflow run logs.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great set of changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redpanda with tiered storage doesn't stop for 10 minutes after being signaled
5 participants