fix(rclone): single config at /etc/rclone-shithub.conf, postgres-readable#21
Merged
Conversation
…conf The previous path was unreachable to the postgres user (Postgres invokes archive_command as itself, /root is mode 0700). Single file at the new path serves both root-run scripts (backup, sync, restore-drill, provisioner) and the postgres-run archive_command.
…cess failed_count in pg_stat_archiver is cumulative — a non-zero count is fine if the failures pre-date the most recent success (e.g., after fixing a misconfigured archive_command). Only the case where last_failed_time > last_archived_time AND that failure is recent (< 10 min) is genuine ongoing breakage.
This was referenced May 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #16. While verifying WAL archiving live I found
pg_stat_archiver.failed_countclimbing — every archive attempt was erroring with:Cause: Postgres invokes
archive_commandas thepostgresuser. The rclone config was at/root/.config/rclone/rclone.conf, mode 0600, in a/rootdir that's mode 0700 — postgres can't even traverse the parent. Every other script (backup-daily, sync-cross-region, restore-drill, provision-wal-buckets) runs as root and never noticed.This PR consolidates to a single config at
/etc/rclone-shithub.conf, mode 0640 root:postgres, so both root-run and postgres-run scripts can read it. One file, one rotation point.Changes
deploy/postgres/{archive_command,backup-daily,verify-wal-archive}.sh,deploy/spaces/sync-cross-region.sh,deploy/restore-drill/run.sh,deploy/cutover/provision-wal-buckets.sh,deploy/docs-site/sync-to-spaces.sh, plus four runbook docs.deploy/ansible/roles/backup/tasks/main.yml: writes the template to the new path withowner: root, group: postgres, mode: "0640". Drops the now-unused/root/.config/rclonedir task.deploy/postgres/verify-wal-archive.sh:failed_countis cumulative since the lastpg_stat_reset_shared('archiver'). The previous "if FAILED_COUNT > 0 then alert" logic would page forever after any historical failure. New logic only flags when the most recent failure is newer than the most recent success AND is within the last 10 min — genuine ongoing breakage, not history.Live state (already mirrored ahead of merge to unbreak archiving)
/etc/rclone-shithub.confexists on the droplet, owned root:postgres, mode 0640./usr/local/bin/shithub-pg-archivepatched to point there.pg_stat_reset_shared('archiver')fired to clear the historical 24 failures.000000010000000000000003and4inspaces-prod:shithub-wal/2026/05/10/.Test plan
pg_stat_archiver.failed_countstays 0;last_archived_timeincrements every ~60s./usr/local/bin/shithub-verify-wal-archive; echo $?returns 0 silently and updates the heartbeat file.