Skip to content

fix(pg): enable WAL archiving via conf.d drop-in#16

Merged
espadonne merged 7 commits into
trunkfrom
fix/wal-archive-enable
May 10, 2026
Merged

fix(pg): enable WAL archiving via conf.d drop-in#16
espadonne merged 7 commits into
trunkfrom
fix/wal-archive-enable

Conversation

@espadonne
Copy link
Copy Markdown
Contributor

@espadonne espadonne commented May 10, 2026

Summary

While doing the post-launch hardening sitrep I checked pg_stat_archiver on the live droplet and found archive_mode = off, failed_count = 0, zero segments in spaces-prod:shithub-wal. Continuous WAL archiving has never run. The runbook claimed RPO ≈ one segment; actual RPO is 24 h (the daily logical dump). PITR isn't possible.

Two reasons it never ran:

  1. The ansible role writes postgresql.conf to /data/pgdata/postgresql.conf and pins Environment=PGDATA=/data/pgdata on the systemd unit — but pg_ctlcluster (the Debian wrapper that the unit's ExecStart actually invokes) ignores PGDATA and uses Debian's per-cluster config at /etc/postgresql/16/main/postgresql.conf. So the templated file at /data/pgdata/ is never read at runtime, and the live config has no archive settings.
  2. The shithub-wal and shithub-wal-dr Spaces buckets were never created — the original provision-do.sh only created shithub-backups{,-dr} and shithub-docs. Even with archive_mode = on, the script would 404.

This PR fixes both, end-to-end automated. Zero dashboard work.

Changes

  • deploy/postgres/archive_command.sh — adds --s3-no-check-bucket. Same gotcha I fixed in backup-daily.sh and sync-cross-region.sh earlier today: scoped Spaces keys lack GetBucketLocation, so rclone's pre-flight 403s.
  • deploy/ansible/roles/postgres/templates/99_shithub_archive.conf — drop-in with wal_level=replica, archive_mode=on, archive_command, archive_timeout=60. Drop-in instead of full-file rewrite because the Debian default carries dozens of platform-specific settings; smaller blast radius.
  • deploy/ansible/roles/postgres/tasks/main.yml — installs the drop-in at /etc/postgresql/16/main/conf.d/, notifies the restart postgres handler. Also installs the verifier + hourly cron entry. The previous postgresql.conf — render task that writes to /data/pgdata/ is left in place for now (harmless — file is ignored at runtime; would matter if we ever migrate the data dir to the block volume).
  • deploy/cutover/provision-wal-buckets.sh — operator-laptop one-shot that:
    1. Mints a temp FullAccess Spaces key via doctl.
    2. SSHes to the app droplet (which already has rclone) and PUT-creates shithub-wal + shithub-wal-dr using inline rclone-config (creds passed via env on the SSH command line, never hit disk).
    3. Deletes the temp FullAccess key (security hygiene — minimize lifetime of a key that can do anything).
    4. Extends the existing shithub-prod-app-rw Spaces key's grants to add readwrite on both new buckets via doctl spaces keys update.
    5. Verifies the droplet can write to both buckets through its production rclone config (round-trip with a .write-probe object).
  • deploy/postgres/verify-wal-archive.sh — hourly health check that asserts (a) archive_mode=on, (b) last_archived_time within 5 min, (c) failed_count == 0, (d) recent segments visible in spaces-prod:shithub-wal. Silent on success (heartbeat to /var/run/shithub-wal-archive.last-clean); writes to /var/log/shithub/wal-archive.log AND emits a shithub-wal-archive-tagged systemd journal record (warning priority) on any failure. Same observability shape as the AIDE check.
  • docs/internal/runbooks/backups.md — first-time-setup section pointing at provision-wal-buckets.sh, plus verification queries and common failure modes.

Operator runbook (laptop)

DEPLOY_HOST=shithub.sh ./deploy/cutover/provision-wal-buckets.sh
ssh root@shithub.sh 'systemctl restart postgresql@16-main'

That's it. The script is idempotent enough to re-run after a partial failure (ensures the temp key is cleaned up via EXIT trap regardless of where it dies).

Test plan

  • After provision script + postgres restart: SELECT * FROM pg_stat_archiver \gx shows last_archived_wal and last_archived_time non-null within ~60 s.
  • rclone --config /root/.config/rclone/rclone.conf --s3-no-check-bucket lsf spaces-prod:shithub-wal/ --recursive shows segments accumulating.
  • failed_count stays 0 over 10 minutes.
  • /usr/local/bin/shithub-verify-wal-archive; echo $? returns 0 silently and updates /var/run/shithub-wal-archive.last-clean.
  • Manually break it (e.g. revoke the bucket grant) → next hourly cron emits a shithub-wal-archive journal entry at warning priority.

Out of scope (future PR)

  • DB-on-block-volume. The /data/pgdata plan never took effect — the live database is on the root disk (/var/lib/postgresql/16/main on /dev/vda1). Migrating the data dir is invasive (real downtime, real risk) and warrants its own PR with a tested cutover script. Today's setup works fine for MVP scale; the root disk is at 7% utilization.

@espadonne espadonne force-pushed the fix/wal-archive-enable branch from 7791ef2 to 8b58604 Compare May 10, 2026 03:45
@espadonne espadonne merged commit f121117 into trunk May 10, 2026
1 check passed
@espadonne espadonne deleted the fix/wal-archive-enable branch May 10, 2026 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant