feat: per-model datastore locks for concurrent model operations#713
feat: per-model datastore locks for concurrent model operations#713
Conversation
Manual testing of compiled binaryTested the compiled binary against a fresh filesystem datastore to verify per-model locks work end-to-end. Setupmkdir /tmp/swamp-lock-test
cd /tmp/swamp-lock-test
swamp repo initCreated two # models/command/shell/model-a.yaml
type: "command/shell"
typeVersion: 1
id: "a0000000-0000-4000-a000-000000000001"
name: model-a
version: 1
tags: {}
globalArguments: {}
methods:
execute:
arguments:
run: "sleep 5"# models/command/shell/model-b.yaml
type: "command/shell"
typeVersion: 1
id: "b0000000-0000-4000-b000-000000000002"
name: model-b
version: 1
tags: {}
globalArguments: {}
methods:
execute:
arguments:
run: "sleep 5"Concurrent execution testLaunched both model method runs in the background simultaneously: swamp model method run model-a execute &
swamp model method run model-b execute &After 2 seconds (while both were still running), checked for lock files: Two separate per-model lock files — one per model, no global lock. ResultsBoth processes started at the same time and completed at the same time — no blocking: Both started at |
Replace the single global datastore lock with per-model locks so that operations on different models can run concurrently. Previously, running `model method run` on one model blocked all other model operations, even on completely unrelated models. With per-model locks, only operations on the same model wait for each other. Closes #706 Co-authored-by: Blake Irvin <blakeirvin@me.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap lock-acquiring code paths in try/finally blocks so per-model locks are always released even when method execution, evaluation, or workflow evaluation throws. Also add staleness detection to the global lock polling loop so per-model commands don't hang indefinitely when a structural command crashes without releasing its lock. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1e75c71 to
c8ebcc7
Compare
1. Global lock now waits for held per-model locks before acquiring, preventing races between structural commands and in-progress per-model operations. 2. S3 sync failures (pull and push) now throw instead of silently continuing with stale/lost data. 3. workflow_run.ts: flushModelLocks() in catch block wrapped in try/catch so original error is preserved if lock release fails. 4. model_method_run.ts: flushModelLocks() in finally block wrapped in try/catch so original error is preserved if lock release fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Move per-model lock release AFTER S3 push completes. Previously locks were released before push, so if push failed another process could acquire the lock, pull stale S3 data, and overwrite local changes. 2. Re-check global lock after acquiring each per-model lock. If a structural command acquired the global lock between the initial check and per-model acquisition, release all per-model locks, wait for the global lock, and retry from scratch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. pullChanged() and pullChangedForModel() now track failed downloads and throw with the list of failed files instead of returning a partial count. Prevents operating on incomplete/stale data. 2. pushChanged() now tracks failed uploads and throws instead of silently dropping files. Prevents data loss when S3 uploads fail. 3. Global lock path in registerDatastoreSyncNamed() now re-throws S3 pull errors instead of logging a warning and continuing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Wrap the flush function body in try/finally so per-model locks are released regardless of whether the S3 push succeeds. Previously, a push failure would throw before reaching lock release, leaving locks held until TTL expiry and blocking other operations on those models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Adversarial Review
I've traced through all code paths, edge cases, and failure scenarios in this per-model locking PR. The implementation is well-designed with appropriate safeguards.
Medium
-
src/cli/repo_context.ts:468-474 — waitForPerModelLocks unbounded wait
- The while(true) loop waits indefinitely for per-model locks to be released before acquiring the global lock.
- Breaking scenario: If another process holds a per-model lock with an extremely large custom TTL (e.g., 24 hours configured via LockOptions), or if the staleness check has a clock skew issue (system clock jumps backward), this function blocks indefinitely.
- Mitigation: The staleness check at line 448 (
Date.now() - acquiredAt <= info.ttlMs) means stale locks are ignored. Default TTL is 30s. This is acceptable given typical usage. - Suggested improvement (optional): Add a configurable maximum wait time with fallback to warning + proceed.
-
src/cli/repo_context.ts:592 — Recursive retry without depth limit
- If
acquireModelLocksdetects a global lock during per-model acquisition, it releases everything and recursively calls itself. - Breaking scenario: Pathological case where a structural command repeatedly acquires/releases the global lock at exactly the wrong timing, causing unbounded recursion.
- Mitigation: The wait loop at lines 577-589 blocks until global lock is released or stale, making the pathological case extremely unlikely in practice.
- Suggested improvement (optional): Add max retry count (e.g., 10) before throwing.
- If
-
src/cli/repo_context.ts:608-617 — Partial lock acquisition on S3 pull failure
- If S3 pullChangedForModel fails mid-way through the lock acquisition loop, some per-model locks are held but the returned flush function is never constructed.
- Breaking scenario: Process A acquires lock for model/X, S3 pull for model/Y fails, error propagates up. Locks for model/X remain in the global
entriesmap until CLI exit or SIGINT. - Why it's OK: The locks ARE registered in the coordinator's
entriesmap (line 558), soflushDatastoreSyncat CLI exit releases them. SIGINT handler also covers this. TTL expiration handles truly orphaned locks. This is acceptable latency, not a leak.
-
src/cli/commands/datastore_lock.ts:128 — Per-model lock status only works for filesystem
scanModelLocksonly scans filesystem datastores. S3 per-model locks are not shown inlock status.- Impact: Users with S3 datastores cannot inspect per-model lock state via the breakglass command.
- Why acceptable: S3 locks can be inspected directly via AWS console/CLI. This is a feature gap, not a bug.
-
src/cli/commands/workflow_run.ts:278-283 — No locks when model references don't resolve
- If
modelRefsis non-empty butfindDefinitionByIdOrNamereturns null for all (models don't exist),resolvedModelsis empty and no locks are acquired. - Why acceptable: The workflow will fail at runtime when the model_method task executes, surfacing a clear error. Acquiring no locks is the correct behavior for this error case.
- If
Low
-
src/domain/workflows/model_reference_extractor.ts:83-93 — Missing nested workflow silently skipped
- If a workflow references a nested workflow that doesn't exist,
nestedWorkflowis null and its model references aren't extracted. - Impact: Potential under-locking if the nested workflow is created after extraction but before execution.
- Why acceptable: The workflow will fail at runtime anyway since the nested workflow doesn't exist. This is a deferred error, not a correctness issue.
- If a workflow references a nested workflow that doesn't exist,
-
Potential race between S3 index mutation and persistence
pullChangedForModelmutatesthis.index.entries[rel].localMtimein memory (line 271) but doesn't persist immediately.- Why acceptable: The index is best-effort optimization. Worst case is redundant file transfers, not data corruption.
Verified Correctness
- Deadlock prevention: Lock ordering via sorted acquisition (line 526-529) is correct.
- TOCTOU mitigation: Re-checking global lock after each per-model acquisition (lines 564-593) minimizes but doesn't eliminate the race. The remaining window is microseconds. Acceptable.
- Path traversal protection:
assertSafePathvalidates all paths derived from S3 index entries. - SIGINT cleanup: Handler at lines 66-76 releases all held locks with 5s timeout before force exit.
- Lock heartbeat management: Recursive retry properly releases locks via
flushDatastoreSyncNamed, which stops heartbeats. - S3 push atomicity: Global lock held during push (line 628-649) prevents concurrent pushers from clobbering each other.
Verdict
PASS — The per-model locking implementation is solid. The medium-severity items are edge cases with appropriate mitigations (TTL expiration, CLI-level cleanup, SIGINT handler). No blocking issues found.
it's been passed but not updated
Summary
Replaces the single global datastore lock with per-model locks so that operations on different models can run concurrently. This fixes the core issue where running
model method runon one model blocked all other model operations, even on completely unrelated models.data/{modelType}/{modelId}/.lock— operations on different models no longer block each otherMap<string, SyncEntry>supporting multiple concurrent lockspullChangedForModel) so S3 datastores only pull/push the relevant model's fileslock status,lock release --model) to show and manage per-model locksCloses #706
Why per-model locks?
The global lock was designed for simplicity, but it creates an O(n) bottleneck for fleet operations. When managing 100+ VMs where each method call takes 30-90 seconds, a single global lock means:
healthCheckonprod-vm-1blocks a quickcheckServiceondev-vm-2— the second command waits or times outdata/command/shell/prod-vm-1/.lockanddata/command/shell/dev-vm-2/.lock)Structural commands (
data gc,repo init, model create/delete) still use the global lock, and per-model lock acquisition checks for a held global lock first — so data integrity is preserved.Impact on users
model method runon different models runs concurrently (the primary win)workflow runacquires only the locks for models referenced in the workflowmodel evaluateandworkflow evaluateuse per-model locks for single-model/workflow evaluationWhat changed
datastore_sync_coordinator.tsMap<string, SyncEntry>with named lock supportrepo_context.tsrequireInitializedRepoUnlocked(),createModelLock(),acquireModelLocks()s3_cache_sync.tspullChangedForModel()for model-scoped S3 syncmodel_method_run.tsworkflow_run.tsmodel_evaluate.ts--allkeeps globalworkflow_evaluate.ts--allkeeps globaldatastore_lock.tslock statusscans per-model locks;lock release --modeladdeddatastore_output.tsmodel_reference_extractor.tsTest plan
deno check,deno lint,deno fmtcleanmodel method runon different models both proceed without blocking, each with its own.lockfile🤖 Generated with Claude Code