-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: fix shutting down while controller log replay is in progress #5707
cluster: fix shutting down while controller log replay is in progress #5707
Conversation
@mmaslankaprv wdyt? This needs unit tests fixing up, but I wanted to ask your opinion on the approach before spending time on that. |
src/v/redpanda/application.cc
Outdated
controller->start().get0(); | ||
controller->start(app_signal.abort_source()).get0(); | ||
|
||
_deferred.emplace_back([this] { controller->shutdown_input().get(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was reordering of the input shutdown to occur after group migration intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see commit message -- this looked to me like it was an accident in c967452 where the group migration stuff was added between controller start and registering the controller shutdown hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh shoot, my bad.
yes, you are right about https://github.com/redpanda-data/redpanda/blame/dev/src/v/redpanda/application.cc#L1268-L1273 i think that comment can be lifted up above the new location of the shutdown. that commit you referenced seems to have hijacked it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, I've moved the line back to the correct side of the comment
b097ed2
to
2be2015
Compare
this looks good, although i am thinking if we could shutdown controller input directly from the app wide abort source, this way we would trigger |
2be2015
to
9b71bc4
Compare
That would be nicer... I played around with this a bit, can't subscribe to the abort source to call shutdown input, because shutdown_input is async (it dispatches to all cores). So to make it work I'd have to do a specialized hook on controller for shutting down just shard0's abort source, which is about as special-casey as passing the shard0_as into start(). |
9b71bc4
to
a60184d
Compare
We've got a merge conflict here |
The underlying wait already takes one, this is just a pass-through.
This fixes shutting down redpanda while it is busy replaying controller log. The controller has its own abort source, but the it does not get registered as a defer() hook (via shutdown_input) until after start() has returned. To enable long running parts of start() to be interruptible, we need an external abort source. Conveniently application already has one: it is only on shard 0, but that's okay because we only need it on controller_stm_shard (0).
a60184d
to
c024fbc
Compare
Rebased (conflict was with another of my PRs) |
Failure is #7397 |
Cover letter
Previously, while we waited for the controller applied offset to reach last_applied, we would ignore SIGINT signals.
This is noticeable if you have a very long controller log.
Fixed it by giving controller::start a reference to the ::stop_signal abort source from application.cc (more explanation as to why in the commit messages).
UX changes
None
Release notes
Improvements