fix(pg): enable WAL archiving via conf.d drop-in by espadonne · Pull Request #16 · tenseleyFlow/shithub

espadonne · 2026-05-10T03:34:58Z

Summary

While doing the post-launch hardening sitrep I checked pg_stat_archiver on the live droplet and found archive_mode = off, failed_count = 0, zero segments in spaces-prod:shithub-wal. Continuous WAL archiving has never run. The runbook claimed RPO ≈ one segment; actual RPO is 24 h (the daily logical dump). PITR isn't possible.

Two reasons it never ran:

The ansible role writes postgresql.conf to /data/pgdata/postgresql.conf and pins Environment=PGDATA=/data/pgdata on the systemd unit — but pg_ctlcluster (the Debian wrapper that the unit's ExecStart actually invokes) ignores PGDATA and uses Debian's per-cluster config at /etc/postgresql/16/main/postgresql.conf. So the templated file at /data/pgdata/ is never read at runtime, and the live config has no archive settings.
The shithub-wal and shithub-wal-dr Spaces buckets were never created — the original provision-do.sh only created shithub-backups{,-dr} and shithub-docs. Even with archive_mode = on, the script would 404.

This PR fixes both, end-to-end automated. Zero dashboard work.

Changes

deploy/postgres/archive_command.sh — adds --s3-no-check-bucket. Same gotcha I fixed in backup-daily.sh and sync-cross-region.sh earlier today: scoped Spaces keys lack GetBucketLocation, so rclone's pre-flight 403s.
deploy/ansible/roles/postgres/templates/99_shithub_archive.conf — drop-in with wal_level=replica, archive_mode=on, archive_command, archive_timeout=60. Drop-in instead of full-file rewrite because the Debian default carries dozens of platform-specific settings; smaller blast radius.
deploy/ansible/roles/postgres/tasks/main.yml — installs the drop-in at /etc/postgresql/16/main/conf.d/, notifies the restart postgres handler. Also installs the verifier + hourly cron entry. The previous postgresql.conf — render task that writes to /data/pgdata/ is left in place for now (harmless — file is ignored at runtime; would matter if we ever migrate the data dir to the block volume).
deploy/cutover/provision-wal-buckets.sh — operator-laptop one-shot that:
1. Mints a temp FullAccess Spaces key via doctl.
2. SSHes to the app droplet (which already has rclone) and PUT-creates shithub-wal + shithub-wal-dr using inline rclone-config (creds passed via env on the SSH command line, never hit disk).
3. Deletes the temp FullAccess key (security hygiene — minimize lifetime of a key that can do anything).
4. Extends the existing shithub-prod-app-rw Spaces key's grants to add readwrite on both new buckets via doctl spaces keys update.
5. Verifies the droplet can write to both buckets through its production rclone config (round-trip with a .write-probe object).
deploy/postgres/verify-wal-archive.sh — hourly health check that asserts (a) archive_mode=on, (b) last_archived_time within 5 min, (c) failed_count == 0, (d) recent segments visible in spaces-prod:shithub-wal. Silent on success (heartbeat to /var/run/shithub-wal-archive.last-clean); writes to /var/log/shithub/wal-archive.log AND emits a shithub-wal-archive-tagged systemd journal record (warning priority) on any failure. Same observability shape as the AIDE check.
docs/internal/runbooks/backups.md — first-time-setup section pointing at provision-wal-buckets.sh, plus verification queries and common failure modes.

Operator runbook (laptop)

DEPLOY_HOST=shithub.sh ./deploy/cutover/provision-wal-buckets.sh
ssh root@shithub.sh 'systemctl restart postgresql@16-main'

That's it. The script is idempotent enough to re-run after a partial failure (ensures the temp key is cleaned up via EXIT trap regardless of where it dies).

Test plan

After provision script + postgres restart: SELECT * FROM pg_stat_archiver \gx shows last_archived_wal and last_archived_time non-null within ~60 s.
rclone --config /root/.config/rclone/rclone.conf --s3-no-check-bucket lsf spaces-prod:shithub-wal/ --recursive shows segments accumulating.
failed_count stays 0 over 10 minutes.
/usr/local/bin/shithub-verify-wal-archive; echo $? returns 0 silently and updates /var/run/shithub-wal-archive.last-clean.
Manually break it (e.g. revoke the bucket grant) → next hourly cron emits a shithub-wal-archive journal entry at warning priority.

Out of scope (future PR)

DB-on-block-volume. The /data/pgdata plan never took effect — the live database is on the root disk (/var/lib/postgresql/16/main on /dev/vda1). Migrating the data dir is invasive (real downtime, real risk) and warrants its own PR with a tested cutover script. Today's setup works fine for MVP scale; the root disk is at 7% utilization.

…etBucketLocation

…ailures

espadonne added 7 commits May 9, 2026 23:45

fix(pg-archive): pass --s3-no-check-bucket; scoped Spaces keys lack G…

e43d4a2

…etBucketLocation

ansible(pg): conf.d drop-in to enable WAL archiving

d034668

ansible(pg): install conf.d archive override at the live config path

75e8b4e

docs(backups): WAL archiving first-time setup, verification, common f…

b1e940d

…ailures

ansible(pg): install verifier + hourly cron

ef7f7ab

ops(pg): hourly WAL-archive health check, journal-tagged on failure

9e3b7ec

cutover: provision WAL buckets + extend key grants via doctl + ssh

8b58604

espadonne force-pushed the fix/wal-archive-enable branch from 7791ef2 to 8b58604 Compare May 10, 2026 03:45

espadonne merged commit f121117 into trunk May 10, 2026
1 check passed

espadonne deleted the fix/wal-archive-enable branch May 10, 2026 03:49

espadonne mentioned this pull request May 10, 2026

fix(rclone): single config at /etc/rclone-shithub.conf, postgres-readable #21

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pg): enable WAL archiving via conf.d drop-in#16

fix(pg): enable WAL archiving via conf.d drop-in#16
espadonne merged 7 commits into
trunkfrom
fix/wal-archive-enable

espadonne commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

espadonne commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Operator runbook (laptop)

Test plan

Out of scope (future PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

espadonne commented May 10, 2026 •

edited

Loading