cluster: fix shutting down while controller log replay is in progress #5707

jcsp · 2022-07-28T14:48:18Z

Cover letter

Previously, while we waited for the controller applied offset to reach last_applied, we would ignore SIGINT signals.

This is noticeable if you have a very long controller log.

Fixed it by giving controller::start a reference to the ::stop_signal abort source from application.cc (more explanation as to why in the commit messages).

UX changes

None

Release notes

Improvements

Redpanda now shuts down more quickly if it is signalled while still starting up.

jcsp · 2022-07-28T14:48:57Z

@mmaslankaprv wdyt? This needs unit tests fixing up, but I wanted to ask your opinion on the approach before spending time on that.

dotnwat · 2022-07-28T21:53:15Z

src/v/redpanda/application.cc

-    controller->start().get0();
+    controller->start(app_signal.abort_source()).get0();
+
+    _deferred.emplace_back([this] { controller->shutdown_input().get(); });


was reordering of the input shutdown to occur after group migration intentional?

Yes, see commit message -- this looked to me like it was an accident in c967452 where the group migration stuff was added between controller start and registering the controller shutdown hook.

oh shoot, my bad.

yes, you are right about https://github.com/redpanda-data/redpanda/blame/dev/src/v/redpanda/application.cc#L1268-L1273 i think that comment can be lifted up above the new location of the shutdown. that commit you referenced seems to have hijacked it.

you're right, I've moved the line back to the correct side of the comment

mmaslankaprv · 2022-07-29T08:43:08Z

this looks good, although i am thinking if we could shutdown controller input directly from the app wide abort source, this way we would trigger controller abort source and not need to pass in the abort source reference to start() method. This wouldn't require us of keeping track of abort source shard.
Subscription would have to be registered right after the controller::wire_up returns

jcsp · 2022-11-24T23:08:46Z

if we could shutdown controller input directly from the app wide abort source, this way we would trigger controller abort source and not need to pass in the abort source reference to start() method.>

That would be nicer... I played around with this a bit, can't subscribe to the abort source to call shutdown input, because shutdown_input is async (it dispatches to all cores). So to make it work I'd have to do a specialized hook on controller for shutting down just shard0's abort source, which is about as special-casey as passing the shard0_as into start().

dotnwat · 2022-11-29T06:12:47Z

We've got a merge conflict here

The underlying wait already takes one, this is just a pass-through.

This fixes shutting down redpanda while it is busy replaying controller log. The controller has its own abort source, but the it does not get registered as a defer() hook (via shutdown_input) until after start() has returned. To enable long running parts of start() to be interruptible, we need an external abort source. Conveniently application already has one: it is only on shard 0, but that's okay because we only need it on controller_stm_shard (0).

jcsp · 2022-11-29T09:05:38Z

Rebased (conflict was with another of my PRs)

dotnwat · 2022-11-29T23:55:31Z

Failure is #7397

jcsp added area/controller kind/bug Something isn't working labels Jul 28, 2022

github-actions bot added the area/redpanda label Jul 28, 2022

dotnwat reviewed Jul 28, 2022

View reviewed changes

dotnwat previously approved these changes Jul 28, 2022

View reviewed changes

jcsp dismissed dotnwat’s stale review via 2be2015 July 29, 2022 08:33

jcsp force-pushed the cluster-startup-shutdown-hang branch from b097ed2 to 2be2015 Compare July 29, 2022 08:33

jcsp requested a review from dotnwat July 29, 2022 08:34

jcsp force-pushed the cluster-startup-shutdown-hang branch from 2be2015 to 9b71bc4 Compare November 24, 2022 23:03

jcsp marked this pull request as ready for review November 24, 2022 23:08

jcsp force-pushed the cluster-startup-shutdown-hang branch from 9b71bc4 to a60184d Compare November 25, 2022 15:38

jcsp added 2 commits November 29, 2022 08:59

raft: let state_machine::wait take an abort source

7be3f80

The underlying wait already takes one, this is just a pass-through.

jcsp force-pushed the cluster-startup-shutdown-hang branch from a60184d to c024fbc Compare November 29, 2022 09:05

dotnwat approved these changes Nov 29, 2022

View reviewed changes

dotnwat merged commit cecf67e into redpanda-data:dev Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: fix shutting down while controller log replay is in progress #5707

cluster: fix shutting down while controller log replay is in progress #5707

jcsp commented Jul 28, 2022

jcsp commented Jul 28, 2022

dotnwat Jul 28, 2022

jcsp Jul 28, 2022

dotnwat Jul 28, 2022 •

edited

jcsp Jul 29, 2022

mmaslankaprv commented Jul 29, 2022

jcsp commented Nov 24, 2022

dotnwat commented Nov 29, 2022

jcsp commented Nov 29, 2022

dotnwat commented Nov 29, 2022

cluster: fix shutting down while controller log replay is in progress #5707

cluster: fix shutting down while controller log replay is in progress #5707

Conversation

jcsp commented Jul 28, 2022

Cover letter

UX changes

Release notes

Improvements

jcsp commented Jul 28, 2022

dotnwat Jul 28, 2022

Choose a reason for hiding this comment

jcsp Jul 28, 2022

Choose a reason for hiding this comment

dotnwat Jul 28, 2022 • edited

Choose a reason for hiding this comment

jcsp Jul 29, 2022

Choose a reason for hiding this comment

mmaslankaprv commented Jul 29, 2022

jcsp commented Nov 24, 2022

dotnwat commented Nov 29, 2022

jcsp commented Nov 29, 2022

dotnwat commented Nov 29, 2022

dotnwat Jul 28, 2022 •

edited