Try to start the queue maintainer multiple times with backoff by brandur · Pull Request #1184 · riverqueue/river

brandur · 2026-03-25T03:41:56Z

This one's aimed at addressing #1161. HookPeriodicJobsStart.Start may
return an error that causes the queue maintainer not to start, and there
are a few other intermittent errors that may cause it not to start (say
in the case of a transient DB problem). If this were to occur, the
course of action currently is for the client to to just spit an error to
logs and not try any additional remediation, which could have the effect
of leaving the queue maintainer offline for extended periods.

Here, try to address this broadly by allowing the queue maintainer a few
attempts at starting, and with our standard exponential backoff (1s, 2s,
4s, 8s, etc.). In case a queue maintainer fails to start completely, the
client requests resignation and hands leadership off to another client
to see if it can start successfully.

I think this is an okay compromise because in case of a non-transient
fundamental error (say HookPeriodicJobsStart.Start always returns an
error), we don't go into a hot loop that starts hammering things.
Instead, we'll get a reasonably responsible slow back off that gives
things a chance to recover, and which should be very visible in logs.

Fixes #1161.

brandur · 2026-03-25T03:58:39Z

internal/maintenance/periodic_job_enqueuer.go

+		initialPeriodicJobs []*riverpilot.PeriodicJob
+		subServices         []startstop.Service
+	)
+	if err := func() error {


Ended up indenting all of this in so that there's a single if err != nil check that we can call stopped() in at the end. Previously, stopped wasn't being invoked in some error branches which could lead this service erroring on start and never really stopping :/

brandur · 2026-03-25T03:59:19Z

internal/maintenance/queue_maintainer.go

 	for _, service := range m.servicesByName {
 		if err := service.Start(ctx); err != nil {
+			startstop.StopAllParallel(maputil.Values(m.servicesByName)...)
+			stopped()


Similarly, make sure stopped() is invoked in the error condition. I really need to go look at this startstop API again to see if we can make this safer.

bgentry · 2026-03-26T21:14:05Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a4e03c6506

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T21:19:08Z

client.go

+
+	c.testSignals.queueMaintainerStartRetriesExhausted.Signal(struct{}{})
+
+	if err := c.clientNotifyBundle.RequestResign(ctx); err != nil {


Abort stale resign requests after leadership loss

tryStartQueueMaintainer runs inside the same goroutine that consumes elector.Listen() notifications, so demotion events are queued until retries/backoff finish. If leadership is lost during those retries, this unconditional RequestResign call can run after another client has already become leader and force that healthy leader to resign, creating unnecessary leadership churn and maintenance interruptions. Please gate this path on current leadership (or cancel retries when a non-leader notification arrives) before sending resign notifications.

Useful? React with 👍 / 👎.

Dang that is a subtle one @brandur 🤔

Oh man, yes that's a good one.

Should be addressed now I think.

@codex re-review

To use Codex here, create a Codex account and connect to github.

@codex re-review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0bd52005af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T22:36:21Z

client.go

+					cancelQueueMaintainerStart()
+					cancelQueueMaintainerStart = func() {}


Stop queue maintainer before canceling start context

Canceling cancelQueueMaintainerStart() before c.queueMaintainer.Stop() changes the cancellation cause seen by maintenance subservices from startstop.ErrStop to generic context.Canceled. That breaks the reindexer’s stop-specific cleanup path (internal/maintenance/reindexer.go, errors.Is(context.Cause(ctx), startstop.ErrStop)), so if leadership is lost during REINDEX CONCURRENTLY, _ccnew/_ccold artifacts may be left behind and subsequent reindex runs will keep skipping until those artifacts are manually removed.

Useful? React with 👍 / 👎.

@codex re-review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1bb76c76e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T22:56:23Z

client.go

+		// Stop the queue maintainer to fully reset its state (and any
+		// sub-services) before retrying.
+		c.queueMaintainer.Stop()


Skip stale retry before stopping queue maintainer

tryStartQueueMaintainer stops the shared maintainer on every start error, even when this retry context has already been canceled due to a leadership loss. If leadership flips (lose then regain) while an earlier Start call is still blocked (for example in a hook that does not return promptly on cancellation), that older goroutine can return later and execute Stop against the newer leader term’s successfully started maintainer, taking maintenance back offline unexpectedly. Add a cancellation/epoch check immediately after Start returns and before calling Stop so stale retries cannot affect a newer leadership term.

Useful? React with 👍 / 👎.

@codex re-review

This one's aimed at addressing #1161. `HookPeriodicJobsStart.Start` may return an error that causes the queue maintainer not to start, and there are a few other intermittent errors that may cause it not to start (say in the case of a transient DB problem). If this were to occur, the course of action currently is for the client to to just spit an error to logs and not try any additional remediation, which could have the effect of leaving the queue maintainer offline for extended periods. Here, try to address this broadly by allowing the queue maintainer a few attempts at starting, and with our standard exponential backoff (1s, 2s, 4s, 8s, etc.). In case a queue maintainer fails to start completely, the client requests resignation and hands leadership off to another client to see if it can start successfully. I think this is an okay compromise because in case of a non-transient fundamental error (say `HookPeriodicJobsStart.Start` always returns an error), we don't go into a hot loop that starts hammering things. Instead, we'll get a reasonably responsible slow back off that gives things a chance to recover, and which should be very visible in logs. Fixes #1161.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f855a82aed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-26T23:04:25Z

client.go

+
+		// Stop the queue maintainer to fully reset its state (and any
+		// sub-services) before retrying.
+		c.queueMaintainer.Stop()


Prevent stale start retries from stopping new leadership term

This Stop call can run from an older tryStartQueueMaintainer goroutine after leadership has changed, because cancellation is only checked just before this line and not synchronized with the stop itself. If leadership is lost between that check and this call, then quickly regained, the stale goroutine can stop a queue maintainer that was started by the newer term, leaving maintenance offline while this client is still leader. Add a term/epoch guard or a cancellation check that is atomic with the stop decision to avoid cross-term teardown.

Useful? React with 👍 / 👎.

brandur force-pushed the brandur-queue-maintainer-start branch 2 times, most recently from fe16c1d to a4e03c6 Compare March 25, 2026 03:57

brandur commented Mar 25, 2026

View reviewed changes

brandur requested a review from bgentry March 25, 2026 04:00

bgentry approved these changes Mar 26, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 26, 2026

View reviewed changes

brandur force-pushed the brandur-queue-maintainer-start branch from 0bd5200 to 1bb76c7 Compare March 26, 2026 22:51

chatgpt-codex-connector bot reviewed Mar 26, 2026

View reviewed changes

brandur force-pushed the brandur-queue-maintainer-start branch from 1bb76c7 to f855a82 Compare March 26, 2026 22:59

brandur added 2 commits March 26, 2026 16:01

Make queue maintainer start cancellable on leadership change

c29f4c1

brandur force-pushed the brandur-queue-maintainer-start branch from f855a82 to c29f4c1 Compare March 26, 2026 23:01

chatgpt-codex-connector bot reviewed Mar 26, 2026

View reviewed changes


		c.testSignals.queueMaintainerStartRetriesExhausted.Signal(struct{}{})

		if err := c.clientNotifyBundle.RequestResign(ctx); err != nil {

		cancelQueueMaintainerStart()
		cancelQueueMaintainerStart = func() {}

Conversation

brandur commented Mar 25, 2026

Uh oh!

brandur Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

bgentry commented Mar 26, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

bgentry Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants