fix(pg): enable WAL archiving via conf.d drop-in#16
Merged
Conversation
7791ef2 to
8b58604
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
While doing the post-launch hardening sitrep I checked
pg_stat_archiveron the live droplet and foundarchive_mode = off,failed_count = 0, zero segments inspaces-prod:shithub-wal. Continuous WAL archiving has never run. The runbook claimed RPO ≈ one segment; actual RPO is 24 h (the daily logical dump). PITR isn't possible.Two reasons it never ran:
postgresql.confto/data/pgdata/postgresql.confand pinsEnvironment=PGDATA=/data/pgdataon the systemd unit — butpg_ctlcluster(the Debian wrapper that the unit'sExecStartactually invokes) ignoresPGDATAand uses Debian's per-cluster config at/etc/postgresql/16/main/postgresql.conf. So the templated file at/data/pgdata/is never read at runtime, and the live config has no archive settings.shithub-walandshithub-wal-drSpaces buckets were never created — the originalprovision-do.shonly createdshithub-backups{,-dr}andshithub-docs. Even witharchive_mode = on, the script would 404.This PR fixes both, end-to-end automated. Zero dashboard work.
Changes
deploy/postgres/archive_command.sh— adds--s3-no-check-bucket. Same gotcha I fixed inbackup-daily.shandsync-cross-region.shearlier today: scoped Spaces keys lackGetBucketLocation, so rclone's pre-flight 403s.deploy/ansible/roles/postgres/templates/99_shithub_archive.conf— drop-in withwal_level=replica,archive_mode=on,archive_command,archive_timeout=60. Drop-in instead of full-file rewrite because the Debian default carries dozens of platform-specific settings; smaller blast radius.deploy/ansible/roles/postgres/tasks/main.yml— installs the drop-in at/etc/postgresql/16/main/conf.d/, notifies therestart postgreshandler. Also installs the verifier + hourly cron entry. The previouspostgresql.conf — rendertask that writes to/data/pgdata/is left in place for now (harmless — file is ignored at runtime; would matter if we ever migrate the data dir to the block volume).deploy/cutover/provision-wal-buckets.sh— operator-laptop one-shot that:doctl.rclone) and PUT-createsshithub-wal+shithub-wal-drusing inline rclone-config (creds passed via env on the SSH command line, never hit disk).shithub-prod-app-rwSpaces key's grants to addreadwriteon both new buckets viadoctl spaces keys update..write-probeobject).deploy/postgres/verify-wal-archive.sh— hourly health check that asserts (a)archive_mode=on, (b)last_archived_timewithin 5 min, (c)failed_count == 0, (d) recent segments visible inspaces-prod:shithub-wal. Silent on success (heartbeat to/var/run/shithub-wal-archive.last-clean); writes to/var/log/shithub/wal-archive.logAND emits ashithub-wal-archive-tagged systemd journal record (warning priority) on any failure. Same observability shape as the AIDE check.docs/internal/runbooks/backups.md— first-time-setup section pointing atprovision-wal-buckets.sh, plus verification queries and common failure modes.Operator runbook (laptop)
DEPLOY_HOST=shithub.sh ./deploy/cutover/provision-wal-buckets.sh ssh root@shithub.sh 'systemctl restart postgresql@16-main'That's it. The script is idempotent enough to re-run after a partial failure (ensures the temp key is cleaned up via EXIT trap regardless of where it dies).
Test plan
SELECT * FROM pg_stat_archiver \gxshowslast_archived_walandlast_archived_timenon-null within ~60 s.rclone --config /root/.config/rclone/rclone.conf --s3-no-check-bucket lsf spaces-prod:shithub-wal/ --recursiveshows segments accumulating.failed_countstays 0 over 10 minutes./usr/local/bin/shithub-verify-wal-archive; echo $?returns 0 silently and updates/var/run/shithub-wal-archive.last-clean.shithub-wal-archivejournal entry at warning priority.Out of scope (future PR)
/data/pgdataplan never took effect — the live database is on the root disk (/var/lib/postgresql/16/mainon/dev/vda1). Migrating the data dir is invasive (real downtime, real risk) and warrants its own PR with a tested cutover script. Today's setup works fine for MVP scale; the root disk is at 7% utilization.