Cloud storage shutdown hang #8587

andrwng · 2023-02-03T00:46:00Z

This PR cleans up shutdown in a few places that, in summation contributed to a hang in shutdown when tiered storage is enabled. This also adds some logs, and updates some methods to coroutines to facilitate logging.

Fixes #8331

Backports Required

UX Changes

Release Notes

Bug Fixes

Fixed a hang at shutdown when using tiered storage.

jcsp · 2023-02-06T14:30:14Z

src/v/cloud_storage/tests/remote_partition_test.cc

+    ss::maybe_yield().get();
+    ss::sleep(std::chrono::milliseconds(10)).get();
+    api.shutdown_connections();
+    g.close().get();


When this fails, does it manifest as a hang of the test? If so, that was fine as a short term reproducer, but for committing it we should wrap this in a spin_wait_with_timeout or similar, so that the test fails cleanly.

jcsp · 2023-02-06T14:31:19Z

The other fanout fix PR merged, so this can be rebased + drop the commit on that file.

jcsp · 2023-02-06T15:19:22Z

src/v/storage/parser.cc

@@ -217,22 +217,22 @@ static ss::future<result<model::record_batch_header>> read_header_impl(

 ss::future<result<model::record_batch_header>>
 continuous_batch_parser::read_header() {
-    return read_header_impl(get_stream(), *_consumer, _recovery);
+    auto& st = get_stream();


nit: this one was arguably easier to read before: is this perhaps a place that you coroutinized to add some debug, but it's gone now?

Yep, this is leftover from some debugging. I'll actually move the coroutinizing into a separate PR, to keep this one focused on actual fixes.

jcsp · 2023-02-06T15:31:05Z

src/v/cloud_storage/remote.cc

@@ -100,6 +102,7 @@ ss::future<> remote::stop() {
 void remote::shutdown_connections() {
    cst_log.debug("Shutting down remote connections...");
    _pool.shutdown_connections();
+    _as.request_abort();


I think this is fine, but it shouldn't be functionally necessary, right? If it was, then I would think something was wrong elsewhere.

If this is just to help with logging, then it's fine, but maybe add a comment to explain that.

Yeah, this was one of the first things I was suspicious of, but it ended up not being necessary. I'll remove it from the PR.

jcsp · 2023-02-06T15:34:22Z

src/v/cloud_storage/remote_partition.cc

@@ -304,13 +303,13 @@ class partition_record_batch_reader_impl final
            unknown_exception_ptr = std::current_exception();
        }

-        // The reader may have been left in an indeterminate state.
-        // Re-set the pointer to it to ensure that it will not be reused.
+        // Regardless of which error, the reader may have been left in an


nit: this confused me for a moment, until I realized that we only fall through here in case of errors: maybe tweak the comment to mention that this path is only taken in case of exceptions from the above try{}

When a node shutdown is taking a long time, It's helpful in understanding in which subsystems the shutdown is spending its time. This commit adds some log statements to aide in debugging that helped identify the source of a hang.

We previously only called set_end_of_stream() when met with an unexpected exception, citing that the stream is left in an undefined state. It seems reasonable to do this with gate_closed_exceptions, since we do this explicitly from the loop if the gate is closed.

We'd previously print "{}" rather than the target string.

We could throw `boost::system::system_error (partial message)` when shutting down the HTTP client in the cloud storage hydration loop. ``` ERROR 2023-02-01 22:32:39,545 [shard 1] cloud_storage - [fiber12~0 kafka/scale_000000/1 [0:642]] - remote_segment.cc:708 - Error in hydraton loop: boost::system::system_error (partial message) ``` This is caused by the HTTP client being shutdown, so this commit makes the client first check if the client is being shutdown before throwing any other error.

Per Seastar guidance[1], we should be returning exceptional futures for asynchronous functions, rather than throwing. This resulted in an uncaught exception. [1] https://docs.seastar.io/master/split/7.html

Reproduces redpanda-data#8331 in debug mode.

andrwng · 2023-02-06T19:56:22Z

I removed the rejiggering of the remote abort_source, and moved the coroutinizing commit into a separate PR in hopes that'll make this one a bit more backportable #8653

jcsp · 2023-02-08T11:17:36Z

Failures were #8662

vshtokman · 2023-02-08T15:43:28Z

/backport v22.3.x

vbotbuildovich · 2023-02-08T15:45:22Z

Failed to run cherry-pick command. I executed the below command:

git cherry-pick -x f72a7ed8f6e0ed825542b59e93f8cdca3a80a2fb b9cf31f7276a6f591b952b37d0315c9bf4f5ab3a d7e3f19bbbdc0afab3a02ebe9daff3494fad9bc6 9488d4ad2a7d17cfc32450af59db4b7ed5605f27 fe657df8ff084a83d3b1211df589191738388842 4c133a0801d11223463cffb44ed92a6aeecd8d99 bb6ef0409ad0f908cd9a0ca987343975e01dab96

Workflow run logs.

dotnwat

great set of changes!

github-actions bot added the area/redpanda label Feb 3, 2023

andrwng force-pushed the cloud-storage-shutdown-hang branch from 30add60 to c7e8cd9 Compare February 3, 2023 17:15

andrwng marked this pull request as ready for review February 4, 2023 00:25

andrwng requested review from VladLazar, Lazin and jcsp February 4, 2023 00:25

jcsp reviewed Feb 6, 2023

View reviewed changes

andrwng added 7 commits February 6, 2023 10:34

add some logs to node-wide shutdown methods

f72a7ed

When a node shutdown is taking a long time, It's helpful in understanding in which subsystems the shutdown is spending its time. This commit adds some log statements to aide in debugging that helped identify the source of a hang.

cloud_storage: signal waiters when aborting hydration

b9cf31f

http: fix prefix logging

9488d4a

We'd previously print "{}" rather than the target string.

http: return exceptional futures instead of throw

4c133a0

Per Seastar guidance[1], we should be returning exceptional futures for asynchronous functions, rather than throwing. This resulted in an uncaught exception. [1] https://docs.seastar.io/master/split/7.html

remote_partition_test: reproduce hang at shutdown

bb6ef04

Reproduces redpanda-data#8331 in debug mode.

andrwng force-pushed the cloud-storage-shutdown-hang branch from c7e8cd9 to bb6ef04 Compare February 6, 2023 19:53

jcsp merged commit 5a5b310 into redpanda-data:dev Feb 8, 2023

andrwng mentioned this pull request Feb 8, 2023

[v22.3.x] Cloud storage shutdown hang #8732

Merged

6 tasks

dotnwat reviewed Feb 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud storage shutdown hang #8587

Cloud storage shutdown hang #8587

andrwng commented Feb 3, 2023 •

edited

jcsp Feb 6, 2023

andrwng Feb 6, 2023

jcsp commented Feb 6, 2023

jcsp Feb 6, 2023

andrwng Feb 6, 2023

jcsp Feb 6, 2023

andrwng Feb 6, 2023

jcsp Feb 6, 2023

andrwng Feb 6, 2023

andrwng commented Feb 6, 2023

jcsp commented Feb 8, 2023

vshtokman commented Feb 8, 2023

vbotbuildovich commented Feb 8, 2023

dotnwat left a comment

Cloud storage shutdown hang #8587

Cloud storage shutdown hang #8587

Conversation

andrwng commented Feb 3, 2023 • edited

Backports Required

UX Changes

Release Notes

Bug Fixes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcsp commented Feb 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng commented Feb 6, 2023

jcsp commented Feb 8, 2023

vshtokman commented Feb 8, 2023

vbotbuildovich commented Feb 8, 2023

dotnwat left a comment

Choose a reason for hiding this comment

andrwng commented Feb 3, 2023 •

edited