Skip to content

fix(rclone): single config at /etc/rclone-shithub.conf, postgres-readable#21

Merged
mfwolffe merged 3 commits into
trunkfrom
fix/rclone-config-shared-path
May 10, 2026
Merged

fix(rclone): single config at /etc/rclone-shithub.conf, postgres-readable#21
mfwolffe merged 3 commits into
trunkfrom
fix/rclone-config-shared-path

Conversation

@espadonne
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #16. While verifying WAL archiving live I found pg_stat_archiver.failed_count climbing — every archive attempt was erroring with:

Failed to load config file "/root/.config/rclone/rclone.conf": open /root/.config/rclone/rclone.conf: permission denied

Cause: Postgres invokes archive_command as the postgres user. The rclone config was at /root/.config/rclone/rclone.conf, mode 0600, in a /root dir that's mode 0700 — postgres can't even traverse the parent. Every other script (backup-daily, sync-cross-region, restore-drill, provision-wal-buckets) runs as root and never noticed.

This PR consolidates to a single config at /etc/rclone-shithub.conf, mode 0640 root:postgres, so both root-run and postgres-run scripts can read it. One file, one rotation point.

Changes

  • Path rename across all callsites: deploy/postgres/{archive_command,backup-daily,verify-wal-archive}.sh, deploy/spaces/sync-cross-region.sh, deploy/restore-drill/run.sh, deploy/cutover/provision-wal-buckets.sh, deploy/docs-site/sync-to-spaces.sh, plus four runbook docs.
  • deploy/ansible/roles/backup/tasks/main.yml: writes the template to the new path with owner: root, group: postgres, mode: "0640". Drops the now-unused /root/.config/rclone dir task.
  • deploy/postgres/verify-wal-archive.sh: failed_count is cumulative since the last pg_stat_reset_shared('archiver'). The previous "if FAILED_COUNT > 0 then alert" logic would page forever after any historical failure. New logic only flags when the most recent failure is newer than the most recent success AND is within the last 10 min — genuine ongoing breakage, not history.

Live state (already mirrored ahead of merge to unbreak archiving)

  • /etc/rclone-shithub.conf exists on the droplet, owned root:postgres, mode 0640.
  • /usr/local/bin/shithub-pg-archive patched to point there.
  • pg_stat_reset_shared('archiver') fired to clear the historical 24 failures.
  • WAL segments now landing — confirmed 000000010000000000000003 and 4 in spaces-prod:shithub-wal/2026/05/10/.

Test plan

  • After merge: re-run ansible (or rely on the live state matching). The next archive cycle stays healthy: pg_stat_archiver.failed_count stays 0; last_archived_time increments every ~60s.
  • /usr/local/bin/shithub-verify-wal-archive; echo $? returns 0 silently and updates the heartbeat file.
  • Manually point archive_command at a bad path → verifier flags within 10 min.

espadonne added 3 commits May 10, 2026 00:11
…conf

The previous path was unreachable to the postgres user (Postgres
invokes archive_command as itself, /root is mode 0700). Single
file at the new path serves both root-run scripts (backup, sync,
restore-drill, provisioner) and the postgres-run archive_command.
…cess

failed_count in pg_stat_archiver is cumulative — a non-zero count
is fine if the failures pre-date the most recent success (e.g.,
after fixing a misconfigured archive_command). Only the case
where last_failed_time > last_archived_time AND that failure is
recent (< 10 min) is genuine ongoing breakage.
@mfwolffe mfwolffe merged commit 7a071b2 into trunk May 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants