fix(rclone): single config at /etc/rclone-shithub.conf, postgres-readable by espadonne · Pull Request #21 · tenseleyFlow/shithub

espadonne · 2026-05-10T04:12:02Z

Summary

Follow-up to #16. While verifying WAL archiving live I found pg_stat_archiver.failed_count climbing — every archive attempt was erroring with:

Failed to load config file "/root/.config/rclone/rclone.conf": open /root/.config/rclone/rclone.conf: permission denied

Cause: Postgres invokes archive_command as the postgres user. The rclone config was at /root/.config/rclone/rclone.conf, mode 0600, in a /root dir that's mode 0700 — postgres can't even traverse the parent. Every other script (backup-daily, sync-cross-region, restore-drill, provision-wal-buckets) runs as root and never noticed.

This PR consolidates to a single config at /etc/rclone-shithub.conf, mode 0640 root:postgres, so both root-run and postgres-run scripts can read it. One file, one rotation point.

Changes

Path rename across all callsites: deploy/postgres/{archive_command,backup-daily,verify-wal-archive}.sh, deploy/spaces/sync-cross-region.sh, deploy/restore-drill/run.sh, deploy/cutover/provision-wal-buckets.sh, deploy/docs-site/sync-to-spaces.sh, plus four runbook docs.
deploy/ansible/roles/backup/tasks/main.yml: writes the template to the new path with owner: root, group: postgres, mode: "0640". Drops the now-unused /root/.config/rclone dir task.
deploy/postgres/verify-wal-archive.sh: failed_count is cumulative since the last pg_stat_reset_shared('archiver'). The previous "if FAILED_COUNT > 0 then alert" logic would page forever after any historical failure. New logic only flags when the most recent failure is newer than the most recent success AND is within the last 10 min — genuine ongoing breakage, not history.

Live state (already mirrored ahead of merge to unbreak archiving)

/etc/rclone-shithub.conf exists on the droplet, owned root:postgres, mode 0640.
/usr/local/bin/shithub-pg-archive patched to point there.
pg_stat_reset_shared('archiver') fired to clear the historical 24 failures.
WAL segments now landing — confirmed 000000010000000000000003 and 4 in spaces-prod:shithub-wal/2026/05/10/.

Test plan

After merge: re-run ansible (or rely on the live state matching). The next archive cycle stays healthy: pg_stat_archiver.failed_count stays 0; last_archived_time increments every ~60s.
/usr/local/bin/shithub-verify-wal-archive; echo $? returns 0 silently and updates the heartbeat file.
Manually point archive_command at a bad path → verifier flags within 10 min.

…conf The previous path was unreachable to the postgres user (Postgres invokes archive_command as itself, /root is mode 0700). Single file at the new path serves both root-run scripts (backup, sync, restore-drill, provisioner) and the postgres-run archive_command.

….config dir task

…cess failed_count in pg_stat_archiver is cumulative — a non-zero count is fine if the failures pre-date the most recent success (e.g., after fixing a misconfigured archive_command). Only the case where last_failed_time > last_archived_time AND that failure is recent (< 10 min) is genuine ongoing breakage.

espadonne added 3 commits May 10, 2026 00:11

ansible(backup): rclone config 0640 root:postgres; drop unused /root/…

211ece3

….config dir task

mfwolffe merged commit 7a071b2 into trunk May 10, 2026
1 check passed

This was referenced May 10, 2026

Production droplet has drifted from ansible-managed state — reconcile or stop pretending #38

Open

audit: read-only droplet-drift checker (issue #38) #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rclone): single config at /etc/rclone-shithub.conf, postgres-readable#21

fix(rclone): single config at /etc/rclone-shithub.conf, postgres-readable#21
mfwolffe merged 3 commits into
trunkfrom
fix/rclone-config-shared-path

espadonne commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

espadonne commented May 10, 2026

Summary

Changes

Live state (already mirrored ahead of merge to unbreak archiving)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants