Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: fix shutting down while controller log replay is in progress #5707

Merged
merged 2 commits into from
Nov 29, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jul 28, 2022

Cover letter

Previously, while we waited for the controller applied offset to reach last_applied, we would ignore SIGINT signals.

This is noticeable if you have a very long controller log.

Fixed it by giving controller::start a reference to the ::stop_signal abort source from application.cc (more explanation as to why in the commit messages).

UX changes

None

Release notes

Improvements

  • Redpanda now shuts down more quickly if it is signalled while still starting up.

@jcsp jcsp added area/controller kind/bug Something isn't working labels Jul 28, 2022
@jcsp
Copy link
Contributor Author

jcsp commented Jul 28, 2022

@mmaslankaprv wdyt? This needs unit tests fixing up, but I wanted to ask your opinion on the approach before spending time on that.

controller->start().get0();
controller->start(app_signal.abort_source()).get0();

_deferred.emplace_back([this] { controller->shutdown_input().get(); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was reordering of the input shutdown to occur after group migration intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see commit message -- this looked to me like it was an accident in c967452 where the group migration stuff was added between controller start and registering the controller shutdown hook.

Copy link
Member

@dotnwat dotnwat Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh shoot, my bad.

yes, you are right about https://github.com/redpanda-data/redpanda/blame/dev/src/v/redpanda/application.cc#L1268-L1273 i think that comment can be lifted up above the new location of the shutdown. that commit you referenced seems to have hijacked it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, I've moved the line back to the correct side of the comment

dotnwat
dotnwat previously approved these changes Jul 28, 2022
@mmaslankaprv
Copy link
Member

this looks good, although i am thinking if we could shutdown controller input directly from the app wide abort source, this way we would trigger controller abort source and not need to pass in the abort source reference to start() method. This wouldn't require us of keeping track of abort source shard.
Subscription would have to be registered right after the controller::wire_up returns

@jcsp jcsp force-pushed the cluster-startup-shutdown-hang branch from 2be2015 to 9b71bc4 Compare November 24, 2022 23:03
@jcsp
Copy link
Contributor Author

jcsp commented Nov 24, 2022

if we could shutdown controller input directly from the app wide abort source, this way we would trigger controller abort source and not need to pass in the abort source reference to start() method.>

That would be nicer... I played around with this a bit, can't subscribe to the abort source to call shutdown input, because shutdown_input is async (it dispatches to all cores). So to make it work I'd have to do a specialized hook on controller for shutting down just shard0's abort source, which is about as special-casey as passing the shard0_as into start().

@jcsp jcsp marked this pull request as ready for review November 24, 2022 23:08
@jcsp jcsp force-pushed the cluster-startup-shutdown-hang branch from 9b71bc4 to a60184d Compare November 25, 2022 15:38
@dotnwat
Copy link
Member

dotnwat commented Nov 29, 2022

We've got a merge conflict here

The underlying wait already takes one, this is just
a pass-through.
This fixes shutting down redpanda while it is busy
replaying controller log.

The controller has its own abort source, but the
it does not get registered as a defer() hook (via shutdown_input)
until after start() has returned.

To enable long running parts of start() to be interruptible,
we need an external abort source.  Conveniently application
already has one: it is only on shard 0, but that's okay because
we only need it on controller_stm_shard (0).
@jcsp jcsp force-pushed the cluster-startup-shutdown-hang branch from a60184d to c024fbc Compare November 29, 2022 09:05
@jcsp
Copy link
Contributor Author

jcsp commented Nov 29, 2022

Rebased (conflict was with another of my PRs)

@dotnwat
Copy link
Member

dotnwat commented Nov 29, 2022

Failure is #7397

@dotnwat dotnwat merged commit cecf67e into redpanda-data:dev Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants