Skip to content

Try to start the queue maintainer multiple times with backoff#1184

Open
brandur wants to merge 2 commits intomasterfrom
brandur-queue-maintainer-start
Open

Try to start the queue maintainer multiple times with backoff#1184
brandur wants to merge 2 commits intomasterfrom
brandur-queue-maintainer-start

Conversation

@brandur
Copy link
Contributor

@brandur brandur commented Mar 25, 2026

This one's aimed at addressing #1161. HookPeriodicJobsStart.Start may
return an error that causes the queue maintainer not to start, and there
are a few other intermittent errors that may cause it not to start (say
in the case of a transient DB problem). If this were to occur, the
course of action currently is for the client to to just spit an error to
logs and not try any additional remediation, which could have the effect
of leaving the queue maintainer offline for extended periods.

Here, try to address this broadly by allowing the queue maintainer a few
attempts at starting, and with our standard exponential backoff (1s, 2s,
4s, 8s, etc.). In case a queue maintainer fails to start completely, the
client requests resignation and hands leadership off to another client
to see if it can start successfully.

I think this is an okay compromise because in case of a non-transient
fundamental error (say HookPeriodicJobsStart.Start always returns an
error), we don't go into a hot loop that starts hammering things.
Instead, we'll get a reasonably responsible slow back off that gives
things a chance to recover, and which should be very visible in logs.

Fixes #1161.

@brandur brandur force-pushed the brandur-queue-maintainer-start branch 2 times, most recently from fe16c1d to a4e03c6 Compare March 25, 2026 03:57
initialPeriodicJobs []*riverpilot.PeriodicJob
subServices []startstop.Service
)
if err := func() error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up indenting all of this in so that there's a single if err != nil check that we can call stopped() in at the end. Previously, stopped wasn't being invoked in some error branches which could lead this service erroring on start and never really stopping :/

for _, service := range m.servicesByName {
if err := service.Start(ctx); err != nil {
startstop.StopAllParallel(maputil.Values(m.servicesByName)...)
stopped()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, make sure stopped() is invoked in the error condition. I really need to go look at this startstop API again to see if we can make this safer.

@brandur brandur requested a review from bgentry March 25, 2026 04:00
@bgentry
Copy link
Contributor

bgentry commented Mar 26, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4e03c6506

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


c.testSignals.queueMaintainerStartRetriesExhausted.Signal(struct{}{})

if err := c.clientNotifyBundle.RequestResign(ctx); err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort stale resign requests after leadership loss

tryStartQueueMaintainer runs inside the same goroutine that consumes elector.Listen() notifications, so demotion events are queued until retries/backoff finish. If leadership is lost during those retries, this unconditional RequestResign call can run after another client has already become leader and force that healthy leader to resign, creating unnecessary leadership churn and maintenance interruptions. Please gate this path on current leadership (or cancel retries when a non-leader notification arrives) before sending resign notifications.

Useful? React with 👍 / 👎.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang that is a subtle one @brandur 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, yes that's a good one.

Should be addressed now I think.

@codex re-review

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex re-review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bd52005af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

client.go Outdated
Comment on lines +1333 to +1334
cancelQueueMaintainerStart()
cancelQueueMaintainerStart = func() {}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Stop queue maintainer before canceling start context

Canceling cancelQueueMaintainerStart() before c.queueMaintainer.Stop() changes the cancellation cause seen by maintenance subservices from startstop.ErrStop to generic context.Canceled. That breaks the reindexer’s stop-specific cleanup path (internal/maintenance/reindexer.go, errors.Is(context.Cause(ctx), startstop.ErrStop)), so if leadership is lost during REINDEX CONCURRENTLY, _ccnew/_ccold artifacts may be left behind and subsequent reindex runs will keep skipping until those artifacts are manually removed.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex re-review

@brandur brandur force-pushed the brandur-queue-maintainer-start branch from 0bd5200 to 1bb76c7 Compare March 26, 2026 22:51
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bb76c76e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1377 to +1379
// Stop the queue maintainer to fully reset its state (and any
// sub-services) before retrying.
c.queueMaintainer.Stop()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip stale retry before stopping queue maintainer

tryStartQueueMaintainer stops the shared maintainer on every start error, even when this retry context has already been canceled due to a leadership loss. If leadership flips (lose then regain) while an earlier Start call is still blocked (for example in a hook that does not return promptly on cancellation), that older goroutine can return later and execute Stop against the newer leader term’s successfully started maintainer, taking maintenance back offline unexpectedly. Add a cancellation/epoch check immediately after Start returns and before calling Stop so stale retries cannot affect a newer leadership term.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex re-review

@brandur brandur force-pushed the brandur-queue-maintainer-start branch from 1bb76c7 to f855a82 Compare March 26, 2026 22:59
brandur added 2 commits March 26, 2026 16:01
This one's aimed at addressing #1161. `HookPeriodicJobsStart.Start` may
return an error that causes the queue maintainer not to start, and there
are a few other intermittent errors that may cause it not to start (say
in the case of a transient DB problem). If this were to occur, the
course of action currently is for the client to to just spit an error to
logs and not try any additional remediation, which could have the effect
of leaving the queue maintainer offline for extended periods.

Here, try to address this broadly by allowing the queue maintainer a few
attempts at starting, and with our standard exponential backoff (1s, 2s,
4s, 8s, etc.). In case a queue maintainer fails to start completely, the
client requests resignation and hands leadership off to another client
to see if it can start successfully.

I think this is an okay compromise because in case of a non-transient
fundamental error (say `HookPeriodicJobsStart.Start` always returns an
error), we don't go into a hot loop that starts hammering things.
Instead, we'll get a reasonably responsible slow back off that gives
things a chance to recover, and which should be very visible in logs.

Fixes #1161.
@brandur brandur force-pushed the brandur-queue-maintainer-start branch from f855a82 to c29f4c1 Compare March 26, 2026 23:01
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f855a82aed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


// Stop the queue maintainer to fully reset its state (and any
// sub-services) before retrying.
c.queueMaintainer.Stop()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent stale start retries from stopping new leadership term

This Stop call can run from an older tryStartQueueMaintainer goroutine after leadership has changed, because cancellation is only checked just before this line and not synchronized with the stop itself. If leadership is lost between that check and this call, then quickly regained, the stale goroutine can stop a queue maintainer that was started by the newer term, leaving maintenance offline while this client is still leader. Add a term/epoch guard or a cancellation check that is atomic with the stop decision to avoid cross-term teardown.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Maintenance process startup errors can leave leaders partially functional until leadership change

2 participants