Skip to content

Cron error recovery #544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 11, 2025
Merged

Cron error recovery #544

merged 3 commits into from
Jun 11, 2025

Conversation

benjie
Copy link
Member

@benjie benjie commented Jun 11, 2025

Currently if cron is running and the database connection goes away, it's likely that cron will bring down the entire worker instance from the inside on the next minute tickover.

This PR changes this, cron now handles its own errors via exponential backoff. This does mean that if there are bugs in cron then those errors will be logged but won't cause the system to exit, but it's necessary to be resilient to connection failure.

In future we should filter errors on expected connection error codes but enumerating those is quite tough:

  • ENOENT (unix socket doesn't exist)
  • ECONNREFUSED
  • ETIMEDOUT
  • EHOSTUNREACH
  • EAI_AGAIN
  • ECONNRESET
  • EPIPE
  • errors relating to SSL
  • Regular PG errors relating to system administration (e.g. 57P03 for recovery, 53300 for too many connections, etc)
  • etc

For now, just retrying infinitely seems like the best approach.

@benjie benjie merged commit b94bcf8 into main Jun 11, 2025
14 checks passed
@benjie benjie deleted the cron-error-recovery branch June 11, 2025 11:07
@benjie benjie restored the cron-error-recovery branch July 29, 2025 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant