Skip to content

fix(snapshot): protect keeper transaction from idle drops (#9)#10

Merged
valehdba merged 1 commit into
mainfrom
fix/issue-9-snapshot-keeper-resilience
May 13, 2026
Merged

fix(snapshot): protect keeper transaction from idle drops (#9)#10
valehdba merged 1 commit into
mainfrom
fix/issue-9-snapshot-keeper-resilience

Conversation

@valehdba
Copy link
Copy Markdown
Owner

Fixes #9.

Root cause

pgclone.schema() / pgclone.database() export a snapshot on a "keeper" connection and leave it idle-in-transaction for the bulk of the clone — hours, in the original report. Three independent paths can silently terminate that keeper and invalidate every subsequent SET TRANSACTION SNAPSHOT importer with the misleading server message ERROR: invalid snapshot identifier:

  1. Firewall / NAT idle TCP drop on remote source connections (the reporter's case: ~2h 15min clone via host alias, classic perimeter idle timeout). The keeper had no keepalives=* in its conninfo, so the OS default (Linux: 7200s before the first probe) was far too lazy to keep the firewall warm.
  2. idle_in_transaction_session_timeout on the source server — a common production safeguard that will kill the keeper without our consent.
  3. statement_timeout firing on COMMIT.

PostgreSQL emits invalid snapshot identifier both for malformed IDs and for IDs whose backing file has been reaped, which is why the reporter's correctly-formatted PG 18 snapshot ID (000000A6-000C4299-1) was rejected — the keeper had been killed and the file removed.

Fix — four layers, all PG 14–18 compatible

  • pgclone_connect_with_keepalives() — parse conninfo with PQconninfoParse, inject keepalives=1 keepalives_idle=30 keepalives_interval=10 keepalives_count=6 only if the user did not set them, connect via PQconnectdbParams. Handles both URI (postgresql://...) and keyword-form strings. ~90s drop detection.
  • pgclone_begin_repeatable_read()SET LOCAL idle_in_transaction_session_timeout = 0 and SET LOCAL statement_timeout = 0 inside the keeper transaction. Both PGC_USERSET; SET LOCAL reverts at COMMIT so pooled connections never leak settings.
  • pgclone_begin_with_imported_snapshot() — when the server returns invalid snapshot identifier, attach a clear errhint naming the most likely causes and the {"consistent": false} opt-out.
  • pgclone_keeper_ping() — cheap SELECT 1 on the keeper before each per-table call and before every sub-phase of pgclone_schema() (FK retry, views, functions, triggers) and pgclone_database() (per-schema loop). Surfaces a clear error naming the keeper instead of letting the next importer waste a COPY before hitting the misleading message.

Files changed

File Change
src/pgclone.c New helpers + harden keeper BEGIN + clearer errhint + 6 ping sites. Version string → 4.3.1.
pgclone.control default_version = '4.3.1'
sql/pgclone--4.3.1.sql New — identical to 4.3.0, no signature changes
sql/pgclone--4.3.0--4.3.1.sql New — empty upgrade with changelog
test/test_snapshot_keeper.sh New regression: sets idle_in_transaction_session_timeout = 1s on source role and runs a 5-table clone; without the fix the keeper is killed mid-loop.
test/run_tests.sh Wires the new test into the Docker runner
.github/workflows/ci.yml Adds the new test step (spaces only, fail-fast: false matrix preserved)
docs/ARCHITECTURE.md New "Snapshot-keeper resilience (v4.3.1)" section

Verification

  • Compiles clean against PG 16 in a fresh build (only the two pre-existing warnings at pgclone.c:3867 / :1089, both outside this diff).
  • The CI matrix (14/15/16/17/18) will exercise the new keeper-resilience test on every supported version.
  • No backward-compat break: existing conninfo strings are preserved (only injected keepalives if the user didn't set any); consistent=>false opt-out is unchanged.

After merge

git checkout main && git pull
git tag -a v4.3.1 -m "v4.3.1 — snapshot-keeper resilience (#9)"
git push origin v4.3.1

Then on the reporter's host:

make clean && make && sudo make install
psql -d itiniris_cl -c "ALTER EXTENSION pgclone UPDATE;"

Long-running pgclone.schema() / pgclone.database() operations export a
snapshot on a "keeper" connection and leave it idle-in-transaction for
the bulk of the clone (hours, in the case of issue #9). Three paths can
silently kill that keeper and invalidate every subsequent SET TRANSACTION
SNAPSHOT importer with the misleading "ERROR: invalid snapshot
identifier":

  1. Firewall / NAT idle TCP drop on remote connections (the user's
     case: ~2h 15min clone via host alias, classic perimeter idle
     timeout).
  2. idle_in_transaction_session_timeout on the source server.
  3. statement_timeout firing on COMMIT.

This change adds four layers of protection, all PG 14-18 compatible:

  * pgclone_connect_with_keepalives(): parse conninfo with
    PQconninfoParse, inject keepalives=1 keepalives_idle=30
    keepalives_interval=10 keepalives_count=6 if the user did not
    set them, connect via PQconnectdbParams. Works for both URI and
    keyword-form conninfo strings. ~90s drop detection; keeps
    perimeter NAT entries warm.

  * pgclone_begin_repeatable_read(): SET LOCAL
    idle_in_transaction_session_timeout = 0 and
    statement_timeout = 0 inside the keeper transaction. Both GUCs
    are PGC_USERSET; SET LOCAL reverts at COMMIT so pooled
    connections don't leak settings.

  * pgclone_begin_with_imported_snapshot(): when the server returns
    "invalid snapshot identifier" -- which it emits both for
    malformed IDs and for IDs whose backing file has been reaped --
    attach a clear errhint pointing at the most common causes and
    the {"consistent": false} opt-out.

  * pgclone_keeper_ping(): cheap SELECT 1 round-trip on the keeper
    before each per-table call and before every sub-phase of
    pgclone_schema() (FK retry, views, functions, triggers) and
    pgclone_database() (per-schema loop). Surfaces a clear error
    naming the keeper instead of letting the next importer hit
    the misleading message after wasting a COPY.

Version bumped to 4.3.1; sql/pgclone--4.3.1.sql identical to 4.3.0
(no signature changes), sql/pgclone--4.3.0--4.3.1.sql is an empty
upgrade with changelog.

Regression test test/test_snapshot_keeper.sh sets
idle_in_transaction_session_timeout=1s on the source role and runs a
5-table schema clone; without the fix the keeper is killed mid-loop
and a later importer fails. Wired into test/run_tests.sh and
.github/workflows/ci.yml so all five PG versions exercise it.

ARCHITECTURE.md documents the three failure paths and the four
mitigation layers.

Closes #9.
@valehdba valehdba merged commit c45cde9 into main May 13, 2026
10 checks passed
@valehdba valehdba deleted the fix/issue-9-snapshot-keeper-resilience branch May 13, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ERROR: pgclone: could not import snapshot

1 participant