Try to start the queue maintainer multiple times with backoff#1184
Try to start the queue maintainer multiple times with backoff#1184
Conversation
fe16c1d to
a4e03c6
Compare
| initialPeriodicJobs []*riverpilot.PeriodicJob | ||
| subServices []startstop.Service | ||
| ) | ||
| if err := func() error { |
There was a problem hiding this comment.
Ended up indenting all of this in so that there's a single if err != nil check that we can call stopped() in at the end. Previously, stopped wasn't being invoked in some error branches which could lead this service erroring on start and never really stopping :/
| for _, service := range m.servicesByName { | ||
| if err := service.Start(ctx); err != nil { | ||
| startstop.StopAllParallel(maputil.Values(m.servicesByName)...) | ||
| stopped() |
There was a problem hiding this comment.
Similarly, make sure stopped() is invoked in the error condition. I really need to go look at this startstop API again to see if we can make this safer.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a4e03c6506
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| c.testSignals.queueMaintainerStartRetriesExhausted.Signal(struct{}{}) | ||
|
|
||
| if err := c.clientNotifyBundle.RequestResign(ctx); err != nil { |
There was a problem hiding this comment.
Abort stale resign requests after leadership loss
tryStartQueueMaintainer runs inside the same goroutine that consumes elector.Listen() notifications, so demotion events are queued until retries/backoff finish. If leadership is lost during those retries, this unconditional RequestResign call can run after another client has already become leader and force that healthy leader to resign, creating unnecessary leadership churn and maintenance interruptions. Please gate this path on current leadership (or cancel retries when a non-leader notification arrives) before sending resign notifications.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
There was a problem hiding this comment.
To use Codex here, create a Codex account and connect to github.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0bd52005af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
client.go
Outdated
| cancelQueueMaintainerStart() | ||
| cancelQueueMaintainerStart = func() {} |
There was a problem hiding this comment.
Stop queue maintainer before canceling start context
Canceling cancelQueueMaintainerStart() before c.queueMaintainer.Stop() changes the cancellation cause seen by maintenance subservices from startstop.ErrStop to generic context.Canceled. That breaks the reindexer’s stop-specific cleanup path (internal/maintenance/reindexer.go, errors.Is(context.Cause(ctx), startstop.ErrStop)), so if leadership is lost during REINDEX CONCURRENTLY, _ccnew/_ccold artifacts may be left behind and subsequent reindex runs will keep skipping until those artifacts are manually removed.
Useful? React with 👍 / 👎.
0bd5200 to
1bb76c7
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1bb76c76e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Stop the queue maintainer to fully reset its state (and any | ||
| // sub-services) before retrying. | ||
| c.queueMaintainer.Stop() |
There was a problem hiding this comment.
Skip stale retry before stopping queue maintainer
tryStartQueueMaintainer stops the shared maintainer on every start error, even when this retry context has already been canceled due to a leadership loss. If leadership flips (lose then regain) while an earlier Start call is still blocked (for example in a hook that does not return promptly on cancellation), that older goroutine can return later and execute Stop against the newer leader term’s successfully started maintainer, taking maintenance back offline unexpectedly. Add a cancellation/epoch check immediately after Start returns and before calling Stop so stale retries cannot affect a newer leadership term.
Useful? React with 👍 / 👎.
1bb76c7 to
f855a82
Compare
This one's aimed at addressing #1161. `HookPeriodicJobsStart.Start` may return an error that causes the queue maintainer not to start, and there are a few other intermittent errors that may cause it not to start (say in the case of a transient DB problem). If this were to occur, the course of action currently is for the client to to just spit an error to logs and not try any additional remediation, which could have the effect of leaving the queue maintainer offline for extended periods. Here, try to address this broadly by allowing the queue maintainer a few attempts at starting, and with our standard exponential backoff (1s, 2s, 4s, 8s, etc.). In case a queue maintainer fails to start completely, the client requests resignation and hands leadership off to another client to see if it can start successfully. I think this is an okay compromise because in case of a non-transient fundamental error (say `HookPeriodicJobsStart.Start` always returns an error), we don't go into a hot loop that starts hammering things. Instead, we'll get a reasonably responsible slow back off that gives things a chance to recover, and which should be very visible in logs. Fixes #1161.
f855a82 to
c29f4c1
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f855a82aed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // Stop the queue maintainer to fully reset its state (and any | ||
| // sub-services) before retrying. | ||
| c.queueMaintainer.Stop() |
There was a problem hiding this comment.
Prevent stale start retries from stopping new leadership term
This Stop call can run from an older tryStartQueueMaintainer goroutine after leadership has changed, because cancellation is only checked just before this line and not synchronized with the stop itself. If leadership is lost between that check and this call, then quickly regained, the stale goroutine can stop a queue maintainer that was started by the newer term, leaving maintenance offline while this client is still leader. Add a term/epoch guard or a cancellation check that is atomic with the stop decision to avoid cross-term teardown.
Useful? React with 👍 / 👎.
This one's aimed at addressing #1161.
HookPeriodicJobsStart.Startmayreturn an error that causes the queue maintainer not to start, and there
are a few other intermittent errors that may cause it not to start (say
in the case of a transient DB problem). If this were to occur, the
course of action currently is for the client to to just spit an error to
logs and not try any additional remediation, which could have the effect
of leaving the queue maintainer offline for extended periods.
Here, try to address this broadly by allowing the queue maintainer a few
attempts at starting, and with our standard exponential backoff (1s, 2s,
4s, 8s, etc.). In case a queue maintainer fails to start completely, the
client requests resignation and hands leadership off to another client
to see if it can start successfully.
I think this is an okay compromise because in case of a non-transient
fundamental error (say
HookPeriodicJobsStart.Startalways returns anerror), we don't go into a hot loop that starts hammering things.
Instead, we'll get a reasonably responsible slow back off that gives
things a chance to recover, and which should be very visible in logs.
Fixes #1161.