Skip to content

fix(s3-datastore, gcs-datastore): don't markSynced from pushChanged no-writeback branch#111

Merged
stack72 merged 1 commit intomainfrom
fix/datastore-pushchanged-no-writeback-marksynced
Apr 28, 2026
Merged

fix(s3-datastore, gcs-datastore): don't markSynced from pushChanged no-writeback branch#111
stack72 merged 1 commit intomainfrom
fix/datastore-pushchanged-no-writeback-marksynced

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented Apr 28, 2026

Summary

Fixes a silent data-loss bug in @swamp/s3-datastore and @swamp/gcs-datastore (introduced in 2026.04.25.2) where a second repo running swamp datastore setup against a non-empty remote bucket ends up with only the migrated bundle file in its local cache, but a .datastore-sync-state.json sidecar marked localDirty: false against the remote's current generation. The next datastore sync --pull hits the fast path and returns "Pulled 0 files" without downloading the actual data.

Failure mode

In each extension's pushChanged slow path, the no-writeback else branch called markSynced(indexGeneration) when pushed === 0. The premise — "nothing to upload ⇒ local matches remote" — is wrong: the slow walk only checks each LOCAL file against the remote index. Remote-index entries with no local counterpart are never visited, so pushed === 0 is consistent with "local is missing N files that remote has." Marking the sidecar clean in this case let the next pullChanged fast-path past unfetched files.

The only legitimate markSynced call sites are:

Caller Evidence Verdict
pullChanged end-of-walk walk downloaded missing files, sized/mtime'd against index
pushChanged writeback wrote remote index that describes local exactly
pushChanged no-writeback (THIS) only proves no LOCAL-ONLY changes; remote-only files never checked

This PR drops the call from that no-writeback branch.

Cost of removal: at most one extra slow path on the next call (1 HEAD + 1 walk) when the cache happens to be genuinely synced but tryFastPushChanged couldn't prove it. Not a regression on any case that was previously correct.

TOCTOU invariant preserved: no post-walk HEAD/getMetadata in this branch — the branch simply does no sidecar work at all now (swamp-club #168).

Bisect window

Failing UAT run: https://github.com/systeminit/swamp-uat/actions/runs/24994591142

  • tests/cli/datastore/sync_test.ts:30 — sync pull retrieves data written by another repo via single-step workflow [gcs]
  • tests/cli/datastore/sync_test.ts:107 — sync pull with multi-step workflow retrieves all model outputs [gcs]
  • tests/cli/datastore/sync_test.ts:308 — sync fast path invalidates when a second writer pushes [s3]

Tests

Replaced the swamp-club #168 "no-writeback records ETag/generation" tests in both *_cache_sync_test.ts files with #1225 regression tests:

  • Seed remote index with two entries.
  • Set up a local cache containing only one of those entries (zero local-only files — nothing to push).
  • Call pushChanged(), assert pushed === 0 AND that the sidecar file does not exist.
  • Symmetric positive: a follow-up pullChanged() must actually download the missing remote file rather than fast-pathing past it.

Also updated two s3 tests that relied on pushChanged against an empty cache as a "clean baseline" — those now use pullChanged priming (the legitimate baseline-setter via end-of-walk markSynced). The gcs equivalents already used pullChanged priming, so no change there.

Manifests bumped to 2026.04.28.1.

Completes the markDirty hardening line started by f918a6e / swamp #1225.

Test plan

  • deno check clean for both extensions
  • deno lint clean for both extensions
  • deno fmt --check clean for both extensions
  • Full unit/integration suites pass: gcs 66/66, s3 62/62 (includes the new #1225 regression test)
  • End-to-end GCS reproduction with locally-built extension against swamp-uat-gcs-datastore-testing: writer pushes 4 records; reader on a fresh repo runs datastore setup + datastore sync --pull and sees 4 (was 0 pre-fix)
  • Same end-to-end against S3
  • CI auto-publish on merge picks up the manifest version bump

🤖 Generated with Claude Code

…o-writeback branch

The no-writeback branch in pushChanged called markSynced when nothing
needed pushing, on the premise that "nothing to upload ⇒ local matches
remote." That premise is wrong: the slow walk only checks each LOCAL
file against the remote index, so remote-only entries are never
visited. `pushed === 0` is therefore consistent with "local is missing
N files that remote has" — exactly the second-repo `datastore setup`
scenario from production. Marking the sidecar clean here let the next
`pullChanged` fast-path past unfetched files and silently lose data.

Drop the markSynced call from that branch entirely. The only
legitimate sites are `pullChanged` end-of-walk (proven local matches
remote by walk + downloads) and `pushChanged`'s writeback branch
(proven by writing back the index that exactly describes local).

Cost: at most one extra slow path on the next call when the cache is
genuinely synced but `tryFastPushChanged` couldn't prove it. Not a
regression on any case that was previously correct.

Tests: replaced the swamp-club #168 "no-writeback records ETag/
generation" tests with #1225 regression tests that pin the new
contract — seed remote with files local doesn't have, run pushChanged,
assert sidecar does not exist, then assert a follow-up pullChanged
actually downloads the missing files. Updated two s3 tests that used
`pushChanged` against an empty cache to "establish a clean baseline"
to use `pullChanged` priming instead (the legitimate baseline-setter).
The TOCTOU invariant (no post-walk HEAD/getMetadata) is preserved.

Bumped both manifests to 2026.04.28.1.

Completes the markDirty hardening line started by f918a6e / swamp #1225.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Blocking Issues

None.

Summary

This is a clean, correct fix for a real data-loss bug. The root cause analysis in the PR description is accurate: the pushChanged slow-walk only iterates local files against the remote index, so pushed === 0 is consistent with "local is missing N files the remote has." Calling markSynced in that branch let the next pullChanged fast-path past files that were never fetched.

Code changes (both S3 and GCS):

  • The indexGeneration/indexETag capture from pullIndex in pushChanged is correctly dropped since the return value is no longer used — the discarded await is intentional and clean.
  • The else-branch markSynced call is removed. The remaining two legitimate call sites remain unchanged: pullChanged end-of-walk (walk confirms all remote index entries are local) and pushChanged writeback (we just wrote the index that exactly describes local).
  • The indexMutated path is unaffected and still calls markSynced correctly via the writeback branch.

Tests:

  • Both *_cache_sync_test.ts files replace the previous TOCTOU-race regression test with a #1225 regression test that sets up the failing scenario (remote has 2 files, local has 1), calls pushChanged, and asserts the sidecar does not exist.
  • The symmetric positive assertion (a subsequent pullChanged actually downloads the missing file rather than fast-pathing) is the right way to prove the end-to-end invariant holds.
  • Two S3 tests that used pushChanged as a clean-baseline setter are correctly updated to use pullChanged instead — the legitimate path to a verified-clean sidecar.
  • Tests use in-memory mock clients (no live cloud services), finally blocks clean up temp dirs, and no env vars are mutated.

CLAUDE.md compliance: no model/ files touched, no any types, named exports only, no npm dependency changes, manifests bumped to 2026.04.28.1.

Suggestions

  1. The 9–10-line comment block added above return pushed is longer than the one-short-line max called out in CLAUDE.md. The WHY here genuinely is non-obvious (a production incident was needed to surface it), so some comment is warranted — but it could be trimmed to 2–3 lines pointing at the issue tracker reference. Not a blocker given the existing codebase already has similar multi-line comment blocks throughout pullChanged.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

I traced every code path in both the GCS and S3 pushChanged methods, the markSynced / tryFastPullChanged / tryFastPushChanged fast-path machinery, and the test changes. I attempted to break this across logic/correctness, error handling, concurrency, data integrity, and API contract dimensions.

Critical / High

None found.

Medium

None found.

Low

  1. GCS test uses manual Deno.stat for sidecar check while S3 test uses readSidecar helper (gcs_cache_sync_test.ts:1597-1604): The GCS test file doesn't have a readSidecar helper, so the test does a manual Deno.stat + NotFound check. This is functionally correct but is a minor asymmetry with the S3 test. Not a bug — just a note.

Analysis

The core fix is logically sound. The key asymmetry that caused the bug:

  • pullChanged walks remote index entries → when pulled === 0, every remote entry has been verified to exist locally. markSynced is safe.
  • pushChanged walks local files → when pushed === 0, every local file exists in the remote index. But remote-only entries are never visited, so the converse (every remote entry exists locally) is NOT proven. markSynced was recording a false "in-sync" state.

The scrubIndex / indexMutated path is unaffected — when indexMutated is true, the writeback branch fires, writes the index, and markSynced is called with the PutObject ETag/generation. That path is safe because the written-back index describes exactly what's local.

The two S3 tests that switched from pushChanged to pullChanged as a baseline-setter (s3_cache_sync_test.ts:1270, s3_cache_sync_test.ts:1548) are correct — pullChanged end-of-walk is now the only non-writeback path that legitimately sets the sidecar.

The regression tests are well-constructed: they set up the exact production scenario (remote has 2 files, local has 1), assert the sidecar is NOT written, and include a symmetric positive assertion that pullChanged afterward actually downloads the missing file. This is end-to-end proof that the fix works.

The unused indexGeneration / indexETag capture from pullIndex() is cleanly removed from pushChanged in both providers. The pullChanged method still captures and uses it correctly.

Verdict

PASS — Clean, well-reasoned fix for a real silent data-loss bug. The removal of markSynced from the no-writeback branch is correct, the performance trade-off (one extra slow path) is acknowledged and acceptable, and the tests thoroughly validate both the negative (sidecar not written) and positive (subsequent pull fetches missing files) cases.

@stack72 stack72 merged commit 0cec056 into main Apr 28, 2026
31 checks passed
@stack72 stack72 deleted the fix/datastore-pushchanged-no-writeback-marksynced branch April 28, 2026 08:05
stack72 added a commit that referenced this pull request Apr 28, 2026
…markSynced in pushChanged no-writeback (#112)

PR #111 dropped the markSynced call from pushChanged's no-writeback branch
unconditionally to fix the swamp-club#1225 data-loss scenario (reader's
`datastore setup` against a non-empty bucket would mark the sidecar clean
even though local was missing every remote-index entry, causing the next
pullChanged to fast-path past unfetched files).

That fix was too aggressive: it also broke the legitimate recovery path
(swamp-uat tests/cli/datastore/sync_test.ts:418), where local cache fully
matches remote, the sidecar was wiped — e.g. by an upgrade migration —
and the next no-op push should regenerate it so subsequent fast-path
syncs work.

The two cases are indistinguishable from the per-LOCAL-file walk alone:
`pushed === 0` only proves no local-only diffs to upload. To tell them
apart we need a remote-side check.

Restore the markSynced call in the no-writeback branch but gate it on
`localHasAllRemoteEntries()` — a stat over every entry in the in-memory
remote index, returning true iff each is present locally with matching
size. Cheap (O(N) stats, only on the slow path that already paid for
pullIndex(forceRemote)) and decisive: data-loss scenario stays
unmarked, recovery scenario regenerates the sidecar.

End-to-end verified against real GCS:
- Writer pushes 19 files / 4 records, fresh reader's pull returns 4
  records (data-loss fix preserved).
- Sidecar deleted, no-op push regenerates it with localDirty=false
  (recovery scenario fixed).

Closes the regression introduced by #111. Bumps both extensions to
2026.04.28.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant